What I Learned at Work this Week: Merge with Pandas

8 min readFeb 6, 2022

I’ve only worked as a Solutions Engineer, but this may be a universal experience in the world of tech. We’re faced with a problem because of a decision made by the architects of our systems. Maybe they weren’t considering a use case that is now more important to our workflow, or maybe they couldn’t get something to work, pushed out an MVP, and it stuck. But whatever the reason, we now have an event populating our database with an empty column:

ID | EVENT_TYPE | USER_ID       | REQUEST_URL | CREATED
1  | SIGNUP     | 4444          | <null>      | 2022-01-01 00:24:15
2  | SIGNUP     | 22186         | <null>      | 2022-01-01 12:13:55

This is a problem because we use this table to generate a report for a client and they want to know the REQUEST_URL associated with all of their SIGNUP events. There are two paths we can walk here:

We can investigate why REQUEST_URL is being dropped off and fix the code. After all the event is being generated by some webpage so it should have an associated URL, right?
We can try to use other data to determine the most likely REQUEST_URL after the fact. We can hack our way through.

I think Option 1 is better. We should at least figure out why this value is null (maybe there’s a good reason) and see if we can get it to start populating. In a perfect world, we’d go with 1, but I ran into some resistance when I tried to make this happen. I work for a big company and I don’t know which engineer wrote the code that populates this table or whether they still work here. I don’t know the name of their team or their PM. I don’t even know where the code lives in our codebase to try and understand how it works and offer a solution. And while in that perfect world I’d have time to figure all this out, write up my own PR, and have a conversation with someone about how it can fit into their module’s framework, our client wants a report ASAP and parsing my way through code that someone else already understands isn’t going to get us there. So after our initial investigation got us nowhere, we went with Option 2.

The Hack

To give a little more context to our semi-hypothetical situation, let’s say our client uses us for a coupon-sending service. The ideal flow is that a user:

Goes to the client’s site
Sees our pop up that says “sign up and we’ll send you a coupon!”
Shares their email address to complete the sign up.

Step 3 generates an event which is saved to our database. We know that these events are missing data for some reason, so we can look elsewhere in the flow to try and find it. Step 2 also generates an event and it stands to reason that, in most cases, if the same user was served a pop up within 5 minutes of signing up, they probably share the same REQUEST_URL. If we and the client are comfortable with this small leap of faith, we can fill in our data with a query like this:

SELECT s.ID,
       s.USER_ID,
       i.REQUEST_URL
       FROM signups s JOIN impressions i
       ON s.USER_ID = i.USER_ID
       AND DATE(s.CREATED) = DATE(i.CREATED);

Even this simple solution is not without issues. Besides the fact that there’s a margin for error in assuming that the REQUEST_URL from an IMPRESSION will accurately match the SIGNUP, this will likely make our query a lot larger. Many more users will see a pop up than those who eventually sign up, so this query puts that more strain on our system. In my case, we limited the date range so that we weren’t pulling every IMPRESSION ever, but it still took over a minute to run. Still, if those were my only issues, I probably wouldn’t be writing a blog about this case. So what was the problem?

Pandas

At my company, we store a bunch of different data in different tables and databases. When we have to traverse multiple tables in the same DB, we can use a query like what we wrote above. But if the data is in two different DBs, simple SQL isn’t going to work for us. In my case, I had to match USER_ID in one table to VISITOR_ID in another, but one user can have multiple visitor_ids and so we had to pull those out of a join/junction/link table in another DB and find the corresponding values. It’s confusing so, long story short, we had to JOIN tables from different DBs. Is that even possible?

I was shortly told that, yes, it is possible and all it requires is the simple import of a library called pandas. On their homepage, pandas describes itself as “a fast, powerful, flexible and easy to use open source data analysis and manipulation tool, built on top of the Python programming language.” I definitely needed to manipulate some data and, luckily for me, my queries were actually already being run inside of a Python script that had pandas imported:

import pandas as pd

After a bit of research, I learned that the pandas merge method was exactly what I was looking for. This method can accept two collections of data and perform what is essentially a SQL JOIN on them. I don’t have much experience running SQL inside of Python, but I know enough to realize that we’re going to have to pay attention to how our data is organized. Running queries in a SQL console, my results always come in the form of a table. In Python, my results could be a dictionary, tuple, list, list of dictionaries, or even a more generic iterable that cannot be indexed. What data type did I return when running the query and what type would merge be looking for?

The first pull used Snowflake. My company has a custom-written series of Snowflake utils, but if you want to do this from scratch, you can check out this post from George Pipis or the Snowflake Python Connector documentation.

with create_snowflake_context(
    environment=ENVIRONMENT,
    database=DB_NAME,
    schema='EVENTS',
    warehouse='MY_WH_NAME',
) as snowflake_context,
  snowflake_context.cursor(DictCursor) as cursor:
    cursor.execute(SIGNUP_EVENT_DATA_QUERY, params)
    signup_query_rows = cursor.fetchall()

If you’re unfamiliar with the syntax, it’s expected that we’ll use with when we’re defining a DB context in Python. The create_snowflake_context function accepts a series of arguments (it likely ends up using Python Connector but has credentials built in so we don’t have to put those in our code over and over again) and returns a context object called snowflake_context. Still inside the execution of the function, we create a cursor, used to execute queries against the contextualized DB. The DictCursor class is an import that allows us to fetch a value by column name.

The context manipulation is completed with the colon, but now that we’re inside the with with a defined cursor, we can start making queries. We use execute to run a previously defined query (in the form of a string) called SIGNUP_EVENT_DATA_QUERY. We pass params as a second argument to declare company IDs, date ranges, or whatever other variables we want to insert into our query. The execute method returns a unique data type that can’t be read unless we use its defined methods or run it through a loop. We want to see all the results of the query, so we assign the return value of cursor.fetchall() to a variable. fetchall returns a list of tuples. This will be important to remember later.

The next piece of data came from MySQL, so we can use MySQL Connector to access that DB. Keep in mind that this with is still within the context of the previous with. We can’t lose our previously established constants:

with MySQLConnection(
    environment=ENVIRONMENT,
    host_key=MYSQL_HOST_KEY,
    mode='r',
) as my_conn, my_conn.cursor() as cursor:
    cursor.execute(
        IMPRESSION_EVENT_DATA_QUERY,
        {'company_id': 123, 'date': '2022-01-01'},
    )
    impression_query_rows = cursor.fetchall()

The syntax here is very similar to what we saw before. We’re passing arguments about the database (mode=’r’ is because we’re going to read only), then we’re creating a cursor and running execute. This time we literally define our second argument rather than using an established variable and we can see that the keys are field names and the values are what we want to insert into the query. We run fetchall again, so we’ve got yet another list of tuples. Let’s merge them together!

DataFrames

If you skipped ahead and read the documentation, you probably noticed that pandas’ merge method is associated with a DataFrame (hence the documentation being called pandas.DataFrame.merge). If we try to pass lists as arguments here, the method will fail. Fortunately, pandas makes it easy to transform a list:

pandas_df_from_signup_data = pd.DataFrame(signup_query_rows)                    pandas_df_from_impression_data = pd.DataFrame(impression_query_rows)

Once that was set, I could finally use merge:

signup_impression_result = pd.merge(
    pandas_df_from_signup_data,
    pandas_df_from_impression_data,
    how='left',
    left_on='USER_ID',
    right_on='user_id',
)

Just like in a SQL join, we name the two tables (or actually DataFrames) we want to merge and declare whether it’s going to be a LEFT JOIN, RIGHT JOIN, INNER, OUTER etc (the how parameter). You might see some cases where on is simply used as an argument instead of left_on and right_on. That will work if the values we’re using to merge have the exact same column names, but that didn’t work for me because these values are case sensitive. Since the data was coming from two different DBs, one had column names in all caps and the other was all lowercase. The syntax here means “match USER_ID from the left table with user_id from the right.”

After a ton of trial and error, my coworker and I got a DataFrame that contained all the data. But we still had to transform it a few more times. First, we had to rename our columns so they would be properly read later in the code:

signup_impression_result.rename(
    columns={
        "REQUEST_URL": "INCORRECT_REQUEST_URL",
        "request_url": "REQUEST_URL",
    },
    inplace=True,
)

Remember that we were editing an existing script which expected to find data in already-specified columns of our results. Our results had one column called REQUEST_URL, which was populated with the null from the original table and another called request_url which had the value we wanted to use. The script was set to look for REQUEST_URL so we swapped out the names to point it in the right direction. inplace is an optional parameter that determines if the rename method returns a value of the new DataFrame. We only wanted to change our existing value, so we set it to True, meaning this code won’t return anything.

The Joy of Data Types

This project started as a course in pandas, but ended up relying heavily on our ability to determine what data types were being passed around in an existing script. After we shaped our DataFrame, we had to turn it back into a list using the to_list method defined by DataFrame. Even as we returned that list, we found that our script was failing because this part of the code was expected to feed the results out iteratively through a yield statement. We literally had to write this just to make it match (the records argument determines the format of the resulting list):

rows = signup_impression_result.to_dict('records')
while rows:
    yield rows
    rows = False

It seems silly, but we prioritized matching the existing logic so that it would be easier to understand for someone who was already familiar with the script. Our jobs are about making choices and we’ll often have to compromise to get results. If we can solve a problem while maintaining readability and some level of efficiency, we can walk away happy.

Sources

pandas documentation
pandas.DataFrame.merge
How To Get Data From Snowflake Using Python, George Pipis
Using the Python Connector, Snowflake
Connecting to MySQL Using Connector/Python, MySQL
pandas.DataFrame.rename
pandas.DataFrame.to_dict