Stop me if you’ve heard this one before: at my job, we have a database with about 35 columns that is referenced by a 2,500 row Python script. We update the table multiple times each day and the script runs multiple times each night. There is one person who understands the script pretty well and everyone else mostly does their best not to break it. When this person is on vacation, we pray that the script will work without incident.
It doesn’t take a software engineer to recognize that this is a tenuous situation and, to credit the members of my team, a lot of them have been spending time writing PRs for this script to try and become better acquainted with it. In an effort to join this group, I set a goal for myself to document as much of it as I can, which means I’m once again reading Python this weekend.
For in if
Last weekend, I wrote another blog about decoding a Python script. In that entry, I learned that following a script’s logic often means starting at the very bottom, so I gave that another shot and found a function called get_pg_company_configs.
To give a little context, the configs we’re searching for are settings for transmitting data to an SFTP (secure file transfer protocol). The configs are arranged in a database where each company can have its own company name, credentials, sharing settings, and more. First we use psycopg2 to connect to Postgres and pull all the configs from our DB, then we filter them depending on the arguments provided in the script invocation. I got stuck trying to read these loops:
[config for config in all_configs if config['id'] in args.etl_id]
If our args contain etl_id, we return a list based on this logic. What’s being said here? Let’s rewrite this but put our keywords in bold:
[config for config in all_configs if config['id'] in args.etl_id]
This wasn’t really easy for me to look up because all I had were the keywords. After searching various combinations of “for, in, if,” I finally found a Stack Overflow question with syntax like what I saw here. In the response, I learned that this is an example of list comprehension, which was much easier to search for.
List comprehension is a one-line solution for returning or performing an operation on elements of a list. One-liners are nifty, but I personally find them difficult to read. Let’s rewrite this logic more traditionally:
for config in all_configs:
if config['id'] in args.etl_id:
return config
Part of what makes list comprehension syntax hard to read is that the return statement, rather than the conditional, is at the beginning. When I’m reading this, I skip over the first config and jump to the for loop that iterates through all_configs (for config in all configs). The logic continues left-to-right by adding a conditional if (if config[‘id’] in args.etl_id). So for each config in all_configs, we’re going to execute some logic if its id property is found within the etl_id list provided when the script is invoked.
To find out what logic will be executed, we go all the way back to the beginning of our line and find config. This is shorthand for adding the config into a list because our whole line is encased in square brackets. We have some flexibility in that part of the list comprehension. If we wanted to return only the ID of the config, we could write:
[config['id'] for config in all_configs if config['id'] in args.etl_id]
If config had a value property that were an integer, we could run a mathematical equation on it as part of our loop. We could return half of every config’s value:
[config['value']/2 for config in all_configs if config['id'] in args.etl_id]
Breaking this down proved really helpful to me since I came across another list comprehension a few lines later:
any(curr_hostname in host for host in default_ec2_hostnames):
In this case, we were using the data as a conditional for running additional logic. any will iterate through a list and check to see if any of the elements are truthy. How is our list constructed?
curr_hostname in host for host in default_ec2_hostnames
curr_hostname is defined earlier in our code, but host is not. We can see that host is the named variable that we’ve chosen as we iterate through default_ec2_hostnames. So we’re checking to see if curr_hostname is in host. But I was confused because I know that curr_hostname is a string. So is it possible that host is a list?
I learned that our default_ec2_hostnames were populated by an imported EC2 method called describe_instances. I Googled that and found this documentation, which describes the resulting object. Reviewing my code again, I saw that default_ec2_hostnames was populated by a property called PrivateDnsName. In the document, I saw an example where that value looks like this:
"ip-10-0-0-157.us-east-2.compute.internal"
curr_hostname is generated by running os.uname().nodename. With this information, my best guess is that nodename ends up being a substring of the hostname, so we’re using in not to check a list, but to check for a string within our larger hostname property.
A Single Step
This is a shorter entry than some others, but it didn’t take much less time to write. I had to dig through not just a layer of Python syntax and also check the logic of unfamiliar imported libraries to determine the data types of my variables. It takes patience to make it through someone else’s code and sometimes the results will be simpler than we realized. Ultimately, we’re still improving our comprehension, one line at a time.
Sources
- psycopg
- Using List Comprehensions Instead of map and filter, O’Reilly
- describe-instances, docs.aws.amazon.com
- Python | os.uname() method, GeeksforGeeks