What I Learned at Work this Week: Strategically Merge DataFrames

Mike Diaz
5 min readAug 27, 2023

--

Photo by cottonbro studio: https://www.pexels.com/photo/woman-in-white-dress-choosing-tarot-cards-6995906/

Thanks to my experience with the ETL job I maintain, I’ve sort of become “the scripting person” on my team. On one hand, it’s a good thing because I’m good at it and I know what I’m doing. But on the other hand, it’s important to grow and I should try to step out of my comfort zone more often. In the coming weeks and months, I’m hoping to do that and, if you’re a loyal reader, I hope it means you see more Java content in that time. This past week, however, was all Python for me, and I still managed to learn something new.

The Job

For the uninitiated, my company sends text or email messages on behalf of retailers encouraging customers to shop. Usually the message will say something like “use this code to save 20% today!” but there are a lot of customization options, including sending a message based on the price or stock of an item. For example, we may want to send a message to a subset of subscribers who have viewed a certain product telling them “the price of this product has dropped by $20, now is a good time to shop!”

To do that, we have to have accurate, up-to-date inventory information on the products our client is offering. We have a product that ingests this information, but the client must send us a file in ndjson format for it to work. As you might expect, some companies are better than others at organizing and formatting their data, so we’re often asked what level of flexibility we have for the ingestion.

My task this week was to take data in the format the client is able to provide and transform it into the ndjson object we seek. This client has a products API that will return general product data, price data, and stock data depending on the endpoint used. So my script would need to hit all three endpoints, merge the data on SKU (short for stock keeping unit, like a product_id), and drop an ndjson object into s3.

The Problem

It took some doing to connect to the API (credit to the client, they have a few layers of security so I had to fetch a rotating auth token and then fetch our company’s assigned creds from where we stored them in Secrets Manager), but ultimately I was able to put the file together. Afterward, the client told us about one more wrinkle:

The catalog we get from the API returns a large group of products, some of which are no longer sold by the client. We want to specify which products are “sellable,” but there’s no field or property in the returned data that reflects this. Instead, if we add a parameter to our API call, we can get a return that includes only sellable products. So if this call gave us all products:

http://api.us.company.com/us/products

Then this one will give us only the sellable products:

http://api.us.company.com/us/products?isSellable=true

We know that we’ll have to make at least two calls to the API: one to collect the products that are sellable and a second to collect those that aren’t. But there’s no parameter that will return only unsellable products, our two options are “all products” or “only sellable products.” The problem is: how can we create a collection that indicates which products are sellable and which aren’t?

Pandas Time

Let’s start with the assumption that we’ve run our two API calls and converted the results into two DataFrames. It’s possible to merge the DataFrames like we would do a JOIN in SQL, but I opted not to go down this road because there was a lot of data in both objects and I really only wanted to put one label from the right onto the left. Instead, I did something that might have been needlessly complicated, but felt more efficient.

First, I extracted a list of SKUs from the smaller DataFrame. This represents every product that is “sellable.”

sellable_products_skus = pd.DataFrame(sellable_products_json['products'])[
'sku'
].tolist()

sellable_products_json is the raw return from the API, which has a top-level key of products. We want to pull the associated value because that’s what has the product data. Using pd.DataFrame() transforms the json into a DataFrame and running ['sku'] on the DataFrame returns just the values from the “column” called SKU. Finally, tolist() turns it into a Python list.

My idea here is to say “if a SKU is in this list, mark it as sellable. If not, mark that false.” Here’s the line that does that with the larger collection:

all_products_df['is_sellable'] = all_products_df.apply(
lambda x: 'true' if x['sku'] in sellable_products_skus else 'false',
axis=1,
)

On the left side, we’re adding a new column called is_sellable. To determine the value in each row, we use the pandas apply function, which applies a given function across an axis of a DataFrame. The function we’re passing is a lambda, also known as an anonymous function. The first part of the lambda, before the colon, is a variable representing the arguments passed. In our case, the argument is going to be each row in the DataFrame, so we can see after the if in our statement, that we’re checking the sku value for the row:

if x['sku'] in sellable_products_skus

This is the crux of our function: for each row, is the SKU contained in the list we previously built, which should contain all the SKUs for sellable products. If yes, we’ll populate the string value 'true', otherwise, 'false'. I made them strings because that ended up being easier to write to the DataFrame.

Note that there’s a second argument in apply, where we pass axis=1. This is an optional argument, but if we don’t pass it, it will default to 0 instead of 1. When 0, we’re operating on a vertical axis, so the function will apply to each column. By switching to 1, we apply the function to each row, which is what we want. Tada, we’ve got a new pseduo-boolean column!

Algorithms

This exercise reminded me of doing interview practice because there multiple possible approaches to this open-ended problem. Honestly, I’m not sure if what I came up with was the most efficient solution, but I like it because it’s readable and…it works! I’m curious if my readers would prefer the merge or an alternate option. Feel free to comment with feedback! We’re all learning together.

Sources

--

--