What I Learned at Work this Week: A SQL Matching Query

Photo by Sebastian Sørensen from Pexels

On Friday afternoon, a coworker and I were trying to better understand one of our company’s products: an integration that triggers conditional logic based on the product ID that a user has interacted with. So if someone browses onto a shirt, they’ll get one type of message and if they view pants, they’ll receive something else. The category is determined by that aforementioned product ID.

The critical part on our side is making sure that we collect and “correct” product ID. There are a lot of different values on a webpage, but only one set will match the number used for categorization on the client’s side (or, more accurately, the third party that handles inventory categorization’s side). If our script doesn’t capture the matching value, the product won’t be properly categorized and won’t trigger any message at all.

We ended up working together to write some SQL that gave us helpful information about how to improve our match rate, but the query wasn’t all that complicated. So, to flesh out this entry, I’ll provide a bit of detail on the process that brought us to the coded solution.

In problem solving, the first step is often the toughest. As engineers, we learn to parse and manipulate information in a bunch of different ways, but what do we do when we don’t have the information handy yet? My coworker and I started with a Looker dashboard. Basically, it said this:

Match rate: 2%

That came along with the prompt “this percentage looks low. Can you raise it?” Okay, awesome.

So the first question is: how is this dashboard populated? Where does this data come from? I was very fortunate because my coworker had already been investigating and identified someone on the team that developed this product. Here’s one thing I learned this week (it seems obvious so forgive me): Looker just uses SQL queries to populate a lot of its charts and dashboards.

If you’re not familiar, Looker is a web-based visual data platform

My company is on the larger side, so tracking down the person who knows the answer to your question isn’t always easy. We had the benefit of receiving a presentation when this particular product was rolled out, so going to the person who gave that presentation was a good starting point. Ultimately, he was able to give us a SQL query that would generate a match rate like the one we were being asked to improve. Once we had that, we could break down the data.

Here’s a simplified version of what we received:

SELECT event_table.company_id,
TO_DATE(event_datetime) AS curr_day,
count_if(cat.product_origin_id = event_table.product_sku) / count(*)
FROM product_view_events event_table
LEFT JOIN (SELECT company_id, product_id
FROM catalog_product_ids
GROUP BY company_id, product_origin_id) cat
ON cat.company_id = event_table.company_id AND cat.product_origin_id = event_table.product_sku
WHERE TO_DATE(event_datetime) >= '2021-10-07'
AND TO_DATE(event_datetime) < '2021-10-08'
AND event_table.company_id = :company_id
GROUP BY event_table.company_id, curr_day;

This is much easier to read in an IDE, so don’t worry if you’re having trouble following it. In any case, the important thing here is that we got the table names. Now I can write queries of my own to start exploring the data!

Since we’re here, let’s take a look at what this query is actually doing:

After SELECT, we should expect to see the columns our table will ultimately return. In this case, we’re going to get a company_id, a date (AS curr_day), and the result of a mathematical equation. It’s a little tough to read because it’s split between two lines but if we start from the beginning, we see a count_if based on two product ID values being equal. This makes sense since our whole system is built on matching.

If the values from two different tables match, then we want to add to our count. The number we end up with will be divided by (indicated by the slash /) the total number of rows (count(*)). If it’s not yet clear, we’re calculating the percentage of rows where our collected product ID equals that found in the catalog written by the third party.

We’re able to compare these two values because we JOIN the two tables together when the company ID and product IDs are identical. But that’s where things get tricky. The purpose of this table is to tell us when those values don’t match. So how could we have a matching condition as we’re building the table?

When I first learned SQL, I had a lot of trouble with JOINs. I didn’t understand how the tables would fit together and, worse of all, there were a bunch of different kinds of JOIN commands:

Image credit of Zahin Rahman on LearnSQL.com

We know that JOIN is used to temporarily merge two tables into one as part of a query. For example, if my tables looked like this:

TABLE_ONE                        TABLE_TWO
ID | PRODUCT_ID | NAME ID | PRODUCT_ID | PRICE
1 | 1AD3 | RED SHIRT 1 | 1AD3 | 21
2 | 223T | GREEN SHIRT 2 | 223T | 15
3 | R652 | BLUE PANTS 3 | CCCC | 65

and I wanted a result that included both NAME and PRICE, I could JOIN the two tables like this:

SELECT PRODUCT_ID,
one.NAME,
two.PRICE
FROM TABLE_ONE one JOIN TABLE_TWO two
ON one.PRODUCT_ID = two.PRODUCT_ID

I would get a result like this:

PRODUCT_ID | one.NAME    | two.PRICE
1AD3 | RED SHIRT | 21
223T | GREEN SHIRT | 15

So we’ve connected name and price, but we’ve lost some data points. What about products R652 and CCCC? Since we used JOIN, the rows that didn’t have a matching PRODUCT_ID in the other table were removed from our final result. That’s what the INNER JOIN label means in the graphic above — we’re just getting that little blue slice here (if we use JOIN without any preposition, it’ll default to INNER JOIN).

But, in our original query, we want to find rows where there isn’t a product ID match. And that’s why we use LEFT JOIN instead of just JOIN. That’s going to populate a result with one row for every row in the “left” table in the join (aka the table that comes before we type our JOIN). Let’s look at the first part of that query again:

SELECT event_table.company_id,
TO_DATE(event_datetime) AS curr_day,
count_if(cat.product_origin_id = event_table.product_sku) / count(*)
FROM product_view_events event_table
LEFT JOIN (SELECT company_id, product_id
FROM catalog_product_ids
GROUP BY company_id, product_origin_id) cat
ON cat.company_id = event_table.company_id AND cat.product_origin_id = event_table.product_sku

The table before LEFT JOIN is product_view_events, which we have given the alias event_table. This table marks every single product-related event which might trigger a message send. Our goal is for 100% of these events to capture a product ID that matches one found in the catalog_product_ids table. But if our captured ID does not have a match, it will still generate a row in our resulting table with a column for cat.product_origin_id, but that field will be null.

This is part of what makes SQL tricky. A lot of this is happening behind the scenes, so we never actually get to see that table with the null values. But it’s important because on line 3, we are counting where cat.product_origin_id is equal to event_table.product_sku. Because of our LEFT JOIN, we’ll have rows in our joined table where the former is null and the latter isn’t, which will give us the feedback we need to get a match rate!

This query is a good example of the value we can get from data being missing. Once I understood that a null result gave me the information I needed, I could seek those out and write other queries to more specifically show me when this was happening. Experienced engineers are successful because they know the tech, the language, and the pipelines, but they also have the benefit of understanding how certain problems can be solved. Hopefully we got a little better at that today.

Solutions Engineer