What I Learned at Work this Week: Algos Never Stop

Mike Diaz
7 min readMay 29, 2022

--

Pexels

A good friend of mine recently graduated from coding boot camp and is now dealing with everyone’s favorite part about being an engineer: interviewing. She’s asked me for help with a couple of algorithm problems and I think I’ve been more of an impediment than anything, so it seems as good a time as any to brush up on my knowledge. Today we’ll review a problem that came up in my workplace and the merits of three possible solutions. (NOTE for loyal readers: I most commonly write in JavaScript, but today I’ll be using Python).

K Most Frequent Elements

Here’s the prompt: Given a non-empty array of integers, return the k most frequent elements. Note that k is always valid, 1 ≤ k ≤ number of unique elements.

Example 1:

Input: arr = [3, 4, 4, 4, 7, 7], k = 2
Output: [4, 7]
// The two most frequent elements are 4 and 7.

Example 2:

Input: arr = [3], k = 1
Output: [3]

The first option for solving this is inefficient, but it’s probably what I would have done if I got this question (and I bet it would have taken me most of the interview time to get there). We can iterate through the given array and count each number by using a dictionary where the keys are numbers from the array and the values are how many times they appear:

input_array = [3, 4, 4, 5, 5, 5]
tracking_dict = {}
for num in input_array:
if num not in tracking_dict:
tracking_dict[num] = 1
else:
tracking_dict[num] += 1

Running this code will result in a dictionary that looks like this:

{
3: 1,
4: 2,
5: 3
}

The number 3 occurs once in the array, 4 occurs twice, and 5 three times. Now if we place these in order, we can figure out the top k numbers:

descending_tuples = sorted(
tracking_dict.items(),
key=lambda item: -item[1]
)

Sorted is a Python function that can accept a collection (in this case a dictionary) and, in the case of a dictionary, return the keys and values as a list of tuples. The first element in the tuple is the key from the dictionary item and the second is the value. The second argument we pass is how the elements should be sorted. It’s a lambda, which in this case means a function on a single line. This lambda is instructing our code to sort based on the second element in the tuple, but that would give us an ascending list where the smallest value comes first. To reverse that, we just put a minus sign before the value.

Once we have the sorted array, we can iterate through it and add the first k values to an array, then return that array. To make this work, let’s assume k = 2:

answer = []
counter = 0
k = 2
for tuple in descending_tuples:
answer.append(tuple[0])
counter += 1
if counter == k:
break
return answer
# [5, 4]

Awesome — that wasn’t so bad! But this particular response might not impress our interviewer because of its time complexity: O(n log n). That’s because the sort function we used implements a divide and conquer approach to organizing our data. Per Joseph Trettevik:

Any algorithm that repeatedly divides a set of data in half and then processes those halves independently with a sub algorithm that has a time complexity of O(N), will have an overall time complexity of O(N log N). Examples include Merge sort, Heap sort, and Quick sort.

There’s nothing I can do about how sort works under the hood, but it sounds like there might be a better option for us in this case. What’s another way we can solve the problem?

heapq

A heap is a data structure that is often used for priority queues. It organizes a collection just like sort, but it comes with a bonus functionality that can save us some time. Our code starts the same way as before by building a dictionary for the numbers and how often they occur. In Python, we can use a heap by importing heapq:

import heapqinput_array = [3, 4, 4, 5, 5, 5]
k = 2
tracking_dict = {}
for num in input_array:
if num not in tracking_dict:
tracking_dict[num] = 1
else:
tracking_dict[num] += 1
return heapq.nlargest(k, tracking_dict.keys(), key=tracking_dict.get)

We’re invoking the function nlargest, which will actually pick out a certain number of elements based on our input. The first argument we pass is the number of elements we want to return (k), the second is an iterable of values that we want to sort. In this case, we’re looking at the keys of our dictionary because that’s what we want to get back. If we left it at two arguments, we’d end up with the two largest keys, but we actually want to sort based on the values associated with those keys, so we need a third argument. Like with sort, we use key= to pass a function that will be used to evaluate our collection. I was confused about how tracking_dict.get would work at first, but it made more sense after reading this Stack Overflow question. Remember that were iterating through the keys in our dictionary, so by passing tracking_dict.get as our sorting function, we’ll be passing the keys as arguments and sorting based on the results:

# 3: tracking_dict.get(3) is 1
# 4: tracking_dict.get(4) is 2, which is larger than 1, so 4 moves to the front
# 5: tracking_dict.get(5) is 3, so 5 moves to the front
# since we passed 2 as our first argument to nlargest, we'll get the 2 largest results: 5 and 4.

This method will improve our efficiency to O(n log k). As we saw in the original prompt, n and k could be equal, but k will never be larger than n. Since it could be smaller (practically speaking it usually is), that’s an upgrade.

The most efficient method

I think that optimizing with a heap is tricky because you have to know about heaps and understand that they’ll stop sorting after we get k results. The third method for solving this problem, however, doesn’t resort to advanced or obscure functionality to once again improve on our efficiency. It starts by creating our standard counting dictionary as well as an array of buckets:

input_array = [3, 4, 4, 5, 5, 5]
counter = {}
frequency_buckets = [[] for i in range(len(input_array)+1)]
for i in input_array:
counter[i] = 1 + counter.get(i, 0)

The syntax for frequency_buckets simply creates an array one element longer than the length of our input array. We’re populating it with a series of empty arrays. Originally, I didn’t understand why these had to be arrays, but it will become clear once we work through the problem. What we also see here is a clever way to populate our counting dictionary:

for i in input_array:
counter[i] = 1 + counter.get(i, 0)

Rather than using if statements, we use get, which allows for a “default value” argument if the key does not exist in the dictionary. So if we’re adding a key for the first time, the value will be set to 1 because counter.get will return 0 rather than a “key does not exist” error.

Right now, frequency_buckets looks like this:

[[],[],[],[],[],[],[]]

Let’s populate it:

for key, val in counter.items():
frequency_buckets[val].append(key)

Remember that counter.items() will return everything from our counter dictionary as an array of tuples. Here we’re iterating through that array and updating frequency_buckets so that the element whose index matches a number’s frequency sees that number appended to its array. If you’re confused, here’s the example laid out:

# counter == {3:1, 4:2, 5:3}
# counter.items() == [(3,1), (4,2), (5,3)]
# frequency_buckets becomes [[], [3], [4], [5], [], [], []]

3 occurs once in the input_array and is placed in index 1’s bucket. 4 occurs twice and is at index 2. 5 goes to index 3. We use arrays inside of frequency_buckets just in case there are numbers that appear at the same frequency as each other:

input_array = [1,4,4,4,4,8,8,8,8]
frequency_buckets == [[],[1],[],[],[4,8],[],[],[],[],[],[]]

Now all we have to do is traverse frequency_buckets backward and add values until we’ve reached the number of values we’ve been asked for:

ans = []
for i in range(len(frequency_buckets)-1, 0, -1):
for n in frequency_buckets[i]:
ans.append(n)
if len(ans) == k:
return ans

We use range to step backward — the first argument is where we’re going to start in the array, the second is where we’ll finish, and the third is the size of our step (negative meaning we go backward). Once our answer array is the proper length, we return it and end the function. This improves our efficiency to O(n) because we never have to compare values to each other to sort them. We make several loops, but those loops run in direct relation to the length of our input array, rather than in exponential relation.

Algos

There’s not much to say about algos that I haven’t written before. I don’t like them, I don’t use them in my day-to-day work, and if my job eventually does become focused on mathematical efficiency quirks, I’ll probably like it a lot less. For most of us, they require a lot of patience and a good teacher/partner to help us work through things we don’t understand. Even I had to gloss over some concepts here because I can’t confidently explain n log n or exponential relationships. Just remember that if you ever get discouraged, anyone can struggle with even the most basic algos and they’re only one piece of the puzzle to being a successful engineer.

Sources

--

--

Mike Diaz
Mike Diaz

No responses yet