What I Learned at Work this Week: Counting Based on Substrings in PostgreSQL

This customer is a churn risk and they’d really like this report ready by the end of the week, can anybody pick up this ticket? Sometimes it’s our turn to say yes to something even though we’d rather not volunteer. If you’ve ever been there, congrats on making it out on the other side.

At work this week, I was the one who grabbed the tricky ticket and thanks to support from some SQL and DB experts on my team, I got the report done. In retrospect, it wasn’t anything too revolutionary, but it might be worth breaking things down for anyone else who comes across a similar situation in the future. This week, we’ll look at one way to parse, organize, group, and count substrings with PostgreSQL.

I think this post is a good opportunity to write some commands that we can use to build our own sample Postgres table at home. It’s something that I constantly see in tutorials but never practice because I’ve already got established tables in the workplace. I don’t want to use those today, so we’ll instead make our own table. Everything I’m writing today works in SQLite, so if you want to follow along but you have a PostgreSQL interface set up, you can use the SQLite option on repl.it.

First, we’ll create a table:

CREATE TABLE events (
host_id integer,
type text,
request text,
date date
);

My table is named “events” and it has four columns: host_id, type, request and date. The data types are listed next to the column names. Next, we can add some rows to the table:

INSERT INTO events (host_id, type, request, date) VALUES (101, 'WELCOME', 'url=https%3A%2F%2Fwww.fakedomain.com%2Fcheckout%2F&ss_ref=android-app%3A%2F%2Fcom.google.android.gm%2F&t=n&a=78948156165548446512354548984561.pu123con456-1&pv=1', '2021–04–11'), (211, 'THANK YOU', 'url=https%3A%2F%2Fwww.fakedomain.com%2Falthomepage%2F&ss_ref=android-app%3A%2F%2Fcom.google.android.gm%2F&t=n&a=78948156165548446512354548984561.pu123con456-1&pv=1', '2021–04–12'), (101, 'WELCOME', ‘url=https%3A%2F%2Fwww.fakedomain.com%2F&ss_ref=randomref%2F&t=n&a=78948156165548446512354548984561.pu123con456-1&pv=1', '2021–04–13'), 
(101, 'WELCOME', 'url=https%3A%2F%2Fwww.fakedomain.com%2Falthomepage%2F&ss_ref=android-app%3A%2F%2Fcom.google.android.gm%2F&t=n&a=78948156165548446512354548984561.pu123con456-1&pv=1', '2021-04-11');

It’s all coming back to me — this isn’t so bad! Our command starts by declaring which table we are going to INSERT INTO and then takes arguments for the name and order of the values we’ll be declaring. After VALUES, each parenthetical includes, in order, our input for host_id, type, request, and date. What’s up with my requests, you might ask? We’ll get to that later on, but their complexity was part of the problem I faced this week, so I wanted to partially replicate that here.

We can run a simple query on our table to confirm that our commands worked:

SELECT * FROM events;

If you’re on repl.it, the results won’t be super easy to read, but they should be there:

Our client wanted to know how frequently certain events happened on certain pages. They have a homepage (fakedomain.com) and one other page (fakedomain.com/althomepage) that they wanted to put into one category, and then all their other pages in another category. But here’s our first issue — we have a value in our table called “request” which contains a URL, but it’s been encoded and it also includes extra information that isn’t useful to us.

It’s common to want to count how frequently something occurs with SQL. This generally means we’ll end up using COUNT and GROUP BY, which ended up being true in my case. I started out by writing a query like this (note that this didn’t get me all the way there):

SELECT host_id,
type,
COUNT(request),
date
FROM events
WHERE date between ‘2021–04–11’ AND ‘2021–04–13’
AND type = ‘WELCOME’
GROUP BY host_id, type, date;

So we’re pulling each column from our table, but we’re using COUNT on request in hope of returning the number of times a certain type of request is made. We use WHERE to filter by the parameters the client provided and then we use GROUP BY to collapse our results based on these parameters. So if the host_id, type, and date are the same for two rows, they will ultimately be returned as one single row. Thanks to the COUNT in our SELECT section, that one row will contain a number telling us how many requests were found among that group. And here are my results:

The results are in the right shape, but something is missing — we’re supposed to qualify based on the URL found in the request. We grouped together two events with completely different request values! Since I originally chose to exclude request from my GROUP BY arguments, the request did not factor that value in with the grouping. But if we add it in, we know that we’ll be grouping by rows with identical requests. And none of our two requests are the same. What we’re trying to do is group rows based on one part of our request. How can we do that?

This task was especially challenging because I was trying to identify a URL in an http request and URL encoding became one of my major hurdles. If you’re unfamiliar with all the %2F and %3A in my request strings, those are symbols that represent characters that have alternate meanings in an http request (%2F is a forward slash / and %3A is a colon :). Encoding occurs because certain http requests require that non-alphanumeric characters be replaced with ASCII representations, which helps with parsing query parameters (though it does of course make the entire request longer/larger). Common URL characters like a colon or backslash fall into this category, which is why we see their ASCII representations all over my example.

Once I got that straight, I had to look for a pattern that could tell me what URL the request was referencing. As we established earlier, a URL ending in fakedomain.com or /althomepage should fall into one category, and everything else should fall into another. I realized that I could identify the end of the URL (or any other param) by searching for an ampersand (the & character). So if a request looked like this:

https%3A%2F%2Fwww.fakedomain.com%2F&ss_ref=randomref%2F&t=n&a=78948156165548446512354548984561.pu123con456-1&pv=1

That translates to:

https://www.fakedomain.com/&ss_ref=randomref/&t=n&a=78948156165548446512354548984561.pu123con456-1&pv=1

And so we can see that our url is https://www.fakedomain.com/ and, following the pattern, our other params are ss_ref, t, a, and pv. I realized that I could identify my desired URLs by searching for a substring of www.fakedomain.com%2F& because the & would indicate that the parameter had ended and therefore the URL had no additional paths. Getting back to SQL, we can use a CASE statement to conditionally return a value. Take a look at this query:

SELECT host_id,
type,
CASE
WHEN request
LIKE '%www.fakedomain.com\%2F&%'
ESCAPE '\'
THEN 'homepage'
WHEN request
LIKE '%www.fakedomain.com\%2Falthomepage\%2F&%'
ESCAPE '\'
THEN 'homepage'
ELSE ‘non-homepage’ END,
date
FROM events;

We know what everything before CASE and after END means — we’re selecting values from the events table. But once we hit CASE, we’re setting a condition to change our return value once we pull data from the row. So WHEN the value of the request column is LIKE the passed argument, we’ll return ‘homepage’ instead of the whole request. Let’s extract that argument:

'%www.fakedomain.com\%2F&%' ESCAPE '\'

Immediately after LIKE, we see a string that starts and ends with a %. Those symbols act as wildcards, indicating that we’re matching a substring that contains any characters before or after it. So if our string is abc=www.fakedomain…, it’ll match. Likewise for any combination of characters at the end. The key is that it has to include the string we defined in between the % symbols: www.fakedomain.com\%2F&. We know that we’re looking for the domain, the ASCII slash (%2F), and an ampersand all together. But there’s one more wrinkle: the encoded slash includes a % itself, which already has a meaning in the context of our LIKE statement, so we have to add an ESCAPE character to indicate that we’re looking for a literal % and not a wildcard. In PostgreSQL, the backslash is the default escape character, but SQLite doesn’t have a default, so we have to define it. That’s why we add ESCAPE ‘\’ after our LIKE string.

We set two conditions because we have two different strings that would indicate a ‘homepage.’ Anything else falls into the second category, so we just add an ELSE, give it a value, and END our CASE statement. That all fits into the SELECT arguments, so we can move on to looking for date and then end our request with FROM events. Here are the results:

Using CASE helped us build a new table with request_page values that are easier to read and categorize, but they’re all still separate rows. We want to COUNT how frequently an event of the same ID, type, and date happened on our page categories. So could we just add back our COUNT and GROUP BY?

SELECT host_id,
type,
count(CASE
WHEN request
LIKE '%www.fakedomain.com\%2F&%'
ESCAPE '\'
THEN 'homepage'
WHEN request
LIKE '%www.fakedomain.com\%2Falthomepage\%2F&%'
ESCAPE '\'
THEN 'homepage'
ELSE 'non-homepage' END) AS request_page,
date
FROM events
GROUP BY host_id, type, request_page, date;

We want to GROUP BY request_page because we want to collapse rows that have the same request_page value, but there’s a problem here. We can’t include an aggregate function in a GROUP BY statement — running this query will result in an error. What we really need is one column to group and one to count. But if we know whether the row is ‘homepage’ or ‘non-homepage’, it doesn’t really matter what we’re counting. We just want to know that the row exists. So we can alter our query slightly (I’m also removing date from our GROUP BY because our dataset doesn’t include any events with the same host_id, type_ request_page, and date to actually group):

SELECT host_id,
type,
CASE
WHEN request
LIKE '%www.fakedomain.com\%2F&%'
ESCAPE '\'
THEN 'homepage'
WHEN request
LIKE '%www.fakedomain.com\%2Falthomepage\%2F&%'
ESCAPE '\'
THEN 'homepage'
ELSE 'non-homepage' END AS request_page,
count(*)
FROM events
GROUP BY host_id, type, request_page;

If you run this query, you should see that it’s worked to group together two two WELCOME events even though one was on the true homepage and the other was on the alt homepage. It worked!

This is really exciting because, when I originally wrote this query, I thought that I would have to write a subquery to rename the request field and query off of that for the COUNT. Here’s what it looked like:

SELECT parsedURLs.host_id,
parsedURLs.type,
count(parsedURLs.request_page),
parsedURLs.date
FROM (SELECT host_id,
type,
CASE
WHEN request
LIKE '%www.fakedomain.com\%2F&%'
ESCAPE '\'
THEN 'homepage'
WHEN request
LIKE '%www.fakedomain.com\%2Falthomepage\%2F&%'
ESCAPE '\'
THEN 'homepage'
ELSE 'non-homepage' END AS request_page,
date
FROM events) AS parsedURLs
GROUP BY parsedURLs.host_id, parsedURLs.type, parsedURLs.date;

Quite a bit harder to read and more labor-intensive to boot. I’m glad that we had the chance to revisit this problem so soon after I had written my original query. It’s great to be able to come up with a quick answer in the moment, but there’s also a benefit to breaking things down and take it step-by-step. And hopefully this process will better equip us to come up with an efficient solution on an earlier iteration.

Solutions Engineer