What I Learned at Work this Week: Filtering SQL Results with Partition
There’s something unique about writing a beautiful SQL query only to be told that the data…doesn’t look right. WELL THAT’S WHAT THE TABLE SAYS, I say to my computer screen before taking a deep breath and writing a measured response thanking the reviewer for their feedback and promising to look into the discrepancy. It’s a good practice, because 95% of the time I’ve made a grouping mistake or I have to adjust input timezones to make things right. This week, however, was one of the rare cases when my query was accurately reading the DB, but that DB had weird/confusing data.
Here’s the situation, as usual trying to make my domain-specific terminology as accessible as possible: Users of my company’s product can “subscribe” to receive “messages” to their phone by interacting with a “creative,” aka a pop-up that appears when they visit one of our client sites. As you might expect, subscribers, messages, and creatives all have IDs, and often they’ll correspond with each other. If a user signed up via creative 123, then we’d expect them to receive message 456. So I was asked to write a report that showed performance of various creative/message pairings over time. Simplified, the results looked something like this:
creative_id | message_id | purchases | revenue
123 | 456 | 45 | 5,011
888 | 555 | 202 | 45,268
123 | 555 | 1 | 99
We can see the connection here: creative 123 and message 456, and then 888 with 555. But what about the last row? It looks like a small number of people who signed up with creative 123 ended up getting message 555. Because one of those people went on to make a purchase, the data ended up in my report. I checked carefully and found the row in the DB that tracked this particular purchase and could find no evidence that it was generated in error, so the report’s data was accurate. But this is unexpected behavior and it’s confusing for the client. It’s not useful data for the client and they’d rather ignore it. But…how can I ask SQL to ignore real data?
Filter Rules
There are lots of ways to hide or filter data in SQL. Working with a partner, we first thought about setting a minimum on number of purchases before a row could be displayed. This caused a problem because the small number of outliers was relative to the total number of creative interactions. If a creative was live for a week, it might have one rogue purchase. But if it was active for a year, it might have 15. I can’t set a static value to ignore all numbers at that size or we’ll actually be filtering out useful data.
Our next thought was to group the rows by creative ID. Depending on what the client wanted, we could add up the purchase and revenue numbers or just take the max. But if we did that, what message ID would we display? We could use MAX() to pick the greater ID, or since we were working in MYSQL, just leave that column out of the GROUP BY and let abstract SQL logic sort it out, but what we really needed was the ability to pick the ID of the row with the higher revenue. Can SQL do that?
Partition
Per thisPointer, there are a bunch of different ways to get this done. A trick that uses ROW_NUMBER was my favorite.
Let’s say my example table was called creative_message_revenue. We could add ROW_NUMBER to a query and generate a new column that ranks each row:
SELECT *,
ROW_NUMBER() OVER(
PARTITION BY creative_id
ORDER BY revenue DESC
) AS rank_in_creative
FROM creative_message_revenue;
The expected results:
creative_id | message_id | purchases | revenue | rank_in_creative
123 | 456 | 45 | 5,011 | 1
888 | 555 | 202 | 45,268 | 1
123 | 555 | 1 | 99 | 2
As you might expect, ROW_NUMBER generates a number for each row. We use OVER to determine how the numbers should be assigned. ORDER BY is responsible for assigning the logic behind how the rows are counted. If we made this the only argument, each row would have its own number as the count increases based on ORDER BY. But in this case, we want to create a ranking among rows that share the same creative ID, so we add a PARTITION argument. When we partition a table, we are creating a series of smaller tables on which we can run independent operations. Here we’re partitioning out our results into groups with common creative IDs. That’s why the results show two rows with the same rank_in_creative value — they were part of different partitions.
You might already see what we’re going to do here. The rows we want to keep are the ones with a rank of 1, because they have the largest value for revenue. That third row is an outlier which we’d like to ignore, which can be filtered out as part of a larger query:
SELECT creative_id,
message_id,
purchases,
revenue
FROM (
SELECT *,
ROW_NUMBER() OVER(
PARTITION BY creative_id
ORDER BY revenue DESC
) AS rank_in_creative
FROM creative_message_revenue
) ranked_results
WHERE ranked_results.rank_in_creative = 1;
Rather than using GROUP BY to consolidate our data, we used PARTITION and ROW_NUMBER to help rank it and pick from the top.
Path of least resistance
As always, I went on a journey to end up with this neat resolution. We had to investigate why the data wasn’t what we expected and we checked for other DBs that might be reliable. We wondered if there was a join table we could use to associate creatives and messages in a different way (there wasn’t). I got on a call with a teammate who was working directly with the company to ask how important that outlier data was, which is when we determined that it could just be discarded. I worked with someone else to brainstorm SQL solutions and, when we got stuck, called it a day and walked away for a while. Even today when I was trying to code out my solution, I realized that ROW_NUMBER isn’t available in the version of MYSQL that I’m using and I had to transfer the whole query to Snowflake (which in fairness wasn’t that bad). Just another example of a solution that’s part code, part communication, and part ingenuity.
Sources
- MySQL select row with max value for each group, thisPointer
- ROW_NUMBER, Snowflake documentation