If you know someone who, like me, spent a long time working in Customer Service, you might notice that they don’t instinctively say “yes” when asked for a favor. Even after agreeing to do it, people who have the experience of getting yelled at when they break a “promise” will develop a slightly less-affirmative response. For example, if someone would say “can you cancel my order?” I wouldn’t say “sure,” but instead “let’s take a look.” Other popular examples are “this should work” or “I’ll give it a shot!”
The instinct to hedge on even the simplest of tasks is still useful for me today. As a Solutions Engineer, I’m often presented with straightforward requests that should take me less than 30 minutes to complete. But at least once a week, one of those requests turns into a much bigger project or investigation.
This week, I was asked to turn a Looker dashboard into a client report. Looker is internal, so the data can’t be accessed by our clients unless it’s also generated in their UI. No problem — just take the SQL that generates the dashboard and plug it into a file that interacts with the UI report service. So I looked up the SQL.
What I found were four separate queries that somehow merge into one table. Three of the queries were straightforward, but the fourth included about 6 columns that I didn’t see in the final dashboard. In fact, there were columns in that dashboard table that I didn’t see after running any of the individual queries. When I asked around for an explanation of what was missing to actually build the table from the UI, I was advised to check out the underlying SQL. Not so simple after all.
Collecting Data
I thought to myself “I know where all this data lives, I can write my own query to put it together!” Let’s take a look at what I was trying to achieve:
In SQL terms, this seems pretty straightforward. We’ll query from a table that tracks each click and group by Message Name and Message Date. If you want to follow along at home, we can create a table with sample data. I’ll be using Snowflake syntax, but at this point it should be pretty standard across SQL variations:
CREATE TEMPORARY TABLE temp1 (message_name string, message_send_datetime datetime, click_datetime datetime)
INSERT INTO temp1 (message_name, message_send_datetime, click_datetime)
VALUES ('Welcome', '2021-12-01 12:00:00.000', '2021-12-01 12:00:05.000'),
('Welcome', '2021-12-01 12:00:00.000', '2021-12-01 12:04:15.000'),
('Welcome', '2021-12-01 12:00:00.000', '2021-12-02 14:59:59.000'),
('Welcome', '2021-12-01 12:00:00.000', '2021-12-19 17:43:05.000'),
('Coupon Reminder', '2021-12-05 12:00:00.000', '2021-12-05 12:00:05.000'),
('Coupon Reminder', '2021-12-05 12:00:00.000', '2021-12-05 15:15:13.000'),
('Coupon Reminder', '2021-12-05 12:00:00.000', '2021-12-06 09:12:10.000');
We want to aggregate this data, so we’re going to write a query with a GROUP BY. It could look something like this:
SELECT message_name,
TO_DATE(click_datetime),
COUNT(TO_DATE(click_datetime))
FROM temp1
GROUP BY message_name, TO_DATE(click_datetime)
The GROUP BY is saying that we want to combine data from rows where the message_name and the date portion of click_datetime are the same. The third column will show how many clicks we have for certain dates and messages. Here are my results:
This contains the data that we want, but it’s not formatted the way the client is expecting it. First of all, they’re not looking for the click_date, they’re looking for how many days between the original message and the click. Let’s update our query to reflect that:
SELECT message_name,
DATEDIFF(DAY, TO_DATE(message_send_datetime), TO_DATE(click_datetime)) AS "Days since send",
COUNT(TO_DATE(click_datetime))
FROM temp1
GROUP BY message_name, TO_DATE(click_datetime), TO_DATE(message_send_datetime)
DATEDIFF is a function that determines the amount of time between two dates. Our first argument is the period of time we want to count (in this case days, but we can use month, year, hour, minute, or second). We’ve finally got numbers in that column instead of dates, but our table isn’t formatted like the Looker dashboard. How can we transform our rows into columns?
A Pivot Table
I had learned about pivot tables years ago when I thought that learning more about Microsoft Excel would help me advance my career (I wish I had been studying JavaScript then). As it turns out, a pivot table is exactly what we need here, and they’re available in SQL!
I was originally writing this query in MySQL, but that language doesn’t actually support the PIVOT function. I was fortunate in that my data had a read replica table in Snowflake, so I transitioned over there. If we want to apply the PIVOT function in Snowflake, we just add it onto the end of our existing query.
I say “just add it,” but this actually took me a lot of trial and error. I finally got something that worked:
SELECT * FROM (
SELECT message_name,
DATEDIFF(DAY,
TO_DATE(message_send_datetime),
TO_DATE(click_datetime)) AS "Days since send",
COUNT(TO_DATE(click_datetime)) AS "Clicks"
FROM temp1
GROUP BY message_name,
TO_DATE(click_datetime),
TO_DATE(message_send_datetime)
)
PIVOT (MAX("Clicks") FOR "Days since send" IN (0, 1, 18));
First of all, we don’t just add it to the end — I actually had to create a subquery to match the examples I found online that all started with SELECT * FROM. A slightly altered version of my original query is inside the subquery. The only difference is that I gave my third column an alias so that it’s easier to reference. This small change made a big difference because Snowflake had trouble processing my query when I was trying to directly reference the values by re-querying (typing COUNT(TO_DATE(click_datetime)) instead of the alias).
Outside of the subquery, we add PIVOT, which accepts an argument broken in to three parts:
- An aggregate function like SUM, COUNT, or MAX. The function should be querying the value we want to present in our new columns. Remember that we’re taking data from rows and pivoting so that they become columns. My subquery already counted up all the clicks, so I chose MAX(“Clicks”) because it’s an aggregate that will simply select the only value that exists. SUM also works, but COUNT will return 1 instead of 2 for Day 0 because I’ve only got 1 row, even though the actual value of the “Clicks” column is 2.
- Next we add FOR and follow it with the value we want to appear as our new column header. In our original table, we had column headers that described the data, like “Days since send.” Now, we want the header to actually be a value from our table with the values positioned underneath. I chose “Days since send” for this value because I want my column headers to be the number of days, then the values to be the number of clicks.
- Finally, we must add IN and pass arguments for the values of our column headers. This requires that we know those values ahead of time, so this option is decidedly not-dynamic. That’s a limitation that I faced when I was working in practice, but for the sake of this example, we can pass in 0, 1 and 18 because those are the only days where there were clicks.
Here’s our resulting table:
This table contains the same data as our original, but more efficiently uses its space and some would argue that it’s easier to read. We can more easily compare the performance of the two messages over time so long as we understand that column “0” means “0 days after send” etc. That’s something that could be addressed by further refining the query.
A Dynamic Query
We learned something by writing a pivot table, but it turns out it wasn’t very helpful in our case. We had to switch to Snowflake to write it, and then once we did, the syntax was complicated and restrictive. We’re trying to pull data on events occurring over a multi-week period; we can’t be expected to know exactly which columns we’ll need every time we run the query.
If we add more arguments to the IN clause in the PIVOT, we’ll get more columns even when they don’t contain data. For example, this is what’ll happen if we add 2, since we had 0 clicks 2 days after our sends:
What we really want is to have a table that will conditionally render columns when there actually is data. After all, our original query conditionally renders rows. In researching fixes for my PIVOT errors, I found an alternative that will solve a bunch of our problems (though still isn’t perfect). If we use CASE, we can tell our query to add columns only when data appears:
SELECT
message_name,
COUNT(CASE
WHEN DATEDIFF(DAY, TO_DATE(message_send_datetime), TO_DATE(click_datetime)) = 0
THEN TO_DATE(click_datetime) END) AS "Day 0",
COUNT(CASE
WHEN DATEDIFF(DAY, TO_DATE(message_send_datetime), TO_DATE(click_datetime)) = 1
THEN TO_DATE(click_datetime) END) AS "Day 1",
COUNT(CASE
WHEN DATEDIFF(DAY, TO_DATE(message_send_datetime), TO_DATE(click_datetime)) = 2
THEN TO_DATE(click_datetime) END) AS "Day 2"
FROM temp1
GROUP BY message_name
If we run this query, we get a similar result, but “Day 2” won’t populate at all. It’ll make for a much cleaner table, though a messier query.
Imperfect Answers
The conditional aggregation solution isn’t perfect either because if we want to track messages over, say, a 30-day period, our query will be needlessly long and inefficient. It’s a real shame because our original query, which didn’t do any pivoting, contains all the data neatly. It just doesn’t match exactly what the client originally asked for.
I was trying to work through this problem with a couple of teammates and one of them suggested checking to see how important the rows vs columns question was to the client. Would they be happy with the more traditional report? It turns out that they were just fine with it! It’s important to put in the effort and try to work through problems when they arise, but it’s equally important to know when a challenge can be overcome by changing the parameters. The best solution is often the simplest.