I had a couple of days off work this week, which means that I didn’t have quite as much time to expose myself to something new. So I’ve added an asterisk to my title because I learned enough last week to cover two blog posts. Can you believe it?
A big part of my job is writing queries to pull data into reports for clients. They’ll want to know things like how many messages we sent for them or how many subscribers they gained each week, which we can determine with pretty straightforward SQL. But last week, I was asked to generate a report that pulled in event counts based on the appearance of certain UTM params in a URL. SQL isn’t great for parsing strings, so someone else on my team had written a Python script for this report. So if I were going to generate a report that met the client’s new needs, I’d have to read, understand, and alter that script.
name and main
Loyal readers will know that understanding a Python file isn’t always an easy task for me. Since I’ve got some experience now, I know that I can often trace the logic step-by-step by jumping to the bottom of the file and looking for a main() invocation. And indeed I found one wrapped in the oft-seen but rarely understood:
if __name__ == '__main__':
Thanks to this article from Free Code Camp, I was reminded that this condition’s purpose is to run a script’s logic if it is the main module being run by an interpreter. If, on the other hand, it’s being imported into a different module, __name__ will not be set to __main__ and the condition will not be met. In my case, the logic below the condition creates a parser that we use to collect arguments that may be passed when someone runs the script. What does that mean? Here’s an example that might be familiar:
git commit -m "first commit"
When using Git, we often run the commit command to save our work. But commits require comments, so we add an argument of -m followed by our comment, in quotes. The script I’m working with accepts an argument of -d followed by a company domain name. That way, we can specify which company we want to generate a report for.
main()
Once I knew what would cause the invocation of main, I could scroll a little higher and see what that entailed. I broke the function into three parts:
- Part 1: Collect the data. Right up front there were three variables declared as counter, join_data, and click_data. They were the results of functions defined elsewhere in the file, such as get_join_data. The counter variable was defined differently:
counter = collections.defaultdict(int)
Collections is an imported module that helps us generate different types of collections like, say, a dictionary! The defaultdict method generates an empty dictionary with a special feature: if we try to reference a key that isn’t contained in the dictionary, it’ll return the default value rather than giving us an error. In this case, the default value is int, a constructor that returns 0.
- Part 2: Count collected data. We invoke a function that doesn’t create a new variable, but does accept our counter variable as an argument:
count_in_file(counter, [row['url'] for row in join_data])
We know that counter is currently an empty dictionary that where errant key references will default to 0. The second argument is a loop that returns a list of URLs from join_data. We still haven’t looked at the function that builds join_data, but since we see for and in here, we know it’s an iterable object. It looks like each “row” in join_data is a dictionary that has a url key, hence our expectation that we’re going to end up with a list of URLs. We’ll look more closely at the execution of count_in_file when we have to make a change to it.
- Part 3: Put the data into a CSV. The script uses open to create a file object and Python’s csv module to create an object that can write to that file. As we might expect, we use a loop to iterate through our data and write it to the CSV one row at a time:
for key, val in sorted(counter.items(), key=lambda item: item[1], reverse=True):
csv_writer.writerow([key, val])
We’re going to iterate through the results of sorted, which is a built-in Python function that can accept up to three arguments. The first argument is the sequence we want to sort. In this case, that’s the result of counter.items(). The Python items method returns a collection of a dictionary’s key value pairs as tuples. We still don’t know what counter looks like, but we’re going to sort the key value pairs somehow.
The next argument in sorted is the key, or A Function to execute to decide the order. As we iterate through our collection, one item will be compared to the next. Since our collection is a collection of tuples, it will be helpful to specify which element within the tuple we want to be the sorting key. We see that our key is noted as key=lambda item: item[1]. Lambda sounds scary, but really it just means that we’re going to write a one-line function, like we would with => in JavaScript. So this means that we’re going to use the second element of our tuple for sorting.
The third argument lets us reverse our list if so desired. In this case, we’ve set it to True, so we’re going to have the highest numbers (or latest letters in the alphabet) first in our collection. We then use our csv_writer to add the key and value to a CSV, in the sorted order.
count_in_file
We saw earlier that a function called count_in_file appears to update the counter dictionary that we define at the beginning of our main method. And then we saw that when we ultimately write a CSV to report on the data we’ve collected, we iterate through the items in counter. So if I want to update the report, I can start with count_in_file.
Most of the function’s responsibility is to parse and split a URL string as it searches for a specific param. Once that’s done, we run:
counter[url or 'EMPTY'] += 1
Our counter’s keys are URLs (or ‘empty’ if we don’t get a URL) and its values are the number of times we find that URL. We’re counting how many “click” and “join” events have the same URLs, but I was asked to add a new column that counted the number of “impressions” associated with these same URLs. This probably sounds confusing if you don’t have domain knowledge, so just consider the request to be generic: we’ve got elements that look like this:
{ 'url_1': 100 }
This works great when we only have to count one thing about our URLs, but now I’m being asked to count click/joins AND impressions. If we want two values associated with one key, we can change our data type from an integer to some sort of collection. I chose list and wrote this:
if not counter[url]:
counter[url] = [0, 0]
counter[url or 'EMPTY'][index] += 1
The first time we come across any URL, I set the default value of that key to a list with two elements of 0. Then, in every instance, I add to one of those elements depending on the index that I’ve provided. Where do we get index? It’s passed as an additional argument when I invoke the function:
count_in_file(counter, 1, [row['url'] for row in impression_data])
So index 0 is for join/click data and index 1 is for impression data. The last thing I have to do to make this work is alter the csv_writer execution:
for key, val in sorted(counter.items(), key=lambda item: item[1], reverse=True):
csv_writer.writerow([key, val[0], val[1]])
When we writerow, our URL (key) is still the first column. Since val is now a list, we make its 0th and 1st indexes our next two columns.
The Full Package
The solution ended up being pretty simple here, but that doesn’t mean this was an easy fix. I had to understand every step of the script and dig into some tricky Python syntax before I could determine how to alter this report. This post didn’t even get into the database queries for get_join_data, get_click_data, and what I wrote for get_impression_data. If you’re interested in learning more about that, this earlier post might be helpful. And this week especially, I encourage confused readers to click through to the Sources if there’s any lingering confusion. I glossed over the functionality of some of these methods, but W3 and Programiz provide a ton of helpful examples. Check them out!
Sources
- Python if __name__ == __main__ Explained with Code Examples, Goran Aviani, Free Code Camp
- Collections — Container datatypes, Python Docs
- Defaultdict in Python, GeeksforGeeks
- python defaultdict: 0 vs. int and [] vs list, Stack Overflow
- Python sorted() Function, W3 Schools
- Python Dictionary items(), Programiz