What I Learned at Work this Week: More Difficult SQL Parsing

Mike Diaz
6 min readDec 18, 2021

--

Photo by Jess Bailey Designs from Pexels

This week, I got a request to write a SQL report that counts the number of times users sign up for my company’s service from different pages on a website. Before I started writing, I spent about 15 minutes searching…because I could have sworn I had written a report like that before. It turns out that I had; and I had even written a blog post about it!

That post also used a gif. gifs are fun!

So the good news was that I found my old work. The less-good news is that my use case was a bit more specific this time. Instead of a binary case where the URL source is either the homepage or any other page, I was being asked to track a variety of UTM sources and UTM mediums. This time, I’d have to parse my string to find substrings between two characters. I’d have to learn something new.

UTM Params

Before starting to code, it’s always helpful to understand why we’re being asked to write something. In this case, a client wanted to know the UTM source and UTM medium associated with sign up events. What does that mean?

Here’s a really good resource from Buffer. They explain that UTM stands for Urchin tracking module, named after Urchin Software Corporation, a company acquired by Google in 2005 that laid the groundwork for Google Analytics. UTM parameters can be added to URLs to provide information about how visitors were driven to a website. This is useful for understand what type of marketing is working for a company. Here’s the example provided by the Buffer article:

http://www.example.com/landing-page?utm_source=google&utm_medium=email&utm_campaign=march2012

We have our standard page address, followed by a question mark, then utm_source=, utm_medium=, and utm_campaign=. What do each of these represent?

  • UTM Source: The website that directed a visitor to our site. If someone saw our ad on Facebook, it should say facebook. If someone searched for our site on Google, it’ll say google.
  • UTM Medium: The type of referrer. Twitter and Facebook are social media sites, so if that’s the source, we might want our Medium to say “social.” If someone clicks on a link in a promotional email, we can set it to say “email.”
  • UTM Campaign: This should be the most specific tag, where we can note the specific promotional push that drove traffic to our site. Looking at the Buffer example, that person must have come from an ad sent in March of 2012.

What I didn’t realize until reading this article is that Google Analytics users have a lot of autonomy when it comes to their UTM settings. They can choose the values they want to append for source, medium, and campaign. As long as we stay consistent and choose something that makes sense to us, we’ll be in good shape.

Now that we better understand why this information might be useful to the client, it’s time to write the report. How can we pull these params out of the URL?

SQL String Parsing

Let’s take a look at one of these URL strings:

https://fakedomain.fake/creative?v=a4bdd118&r=https%3A%2F%2Fwww.fakedomain.com%2Fmens-designer-clothing%3Fintctr%3Dhp_g_mens&w=390&h=663&id=41f5d5f488af4ef197330bf1bea08f2f&l=https%3A%2F%2Fwww.fakedomain.com%2Flong-sleeve-logo-tee%2F105519.html%3Fcolor%3D2026&ss_ref=https%3A%2F%2Fwww.fakedomain.com%2F%3Futm_source%3Dpjn%26utm_medium%3Daffiliate%26utm_campaign%3D242132%26cvosrc%3Daffiliate.pepperjam.242132%26clickId%3D3807140982&t=n&pv=2

You might notice that this isn’t technically a URL string. Besides the fact that part of it is encoded, it also has three different https references. This is actually a network request that includes an encoded URL in it. Fortunately for us, the UTM Source and Medium only appear once, so we don’t have to worry about any duplication or grabbing the wrong values.

Code-wise, we’d like to identify the space in this string that says “utm_source” and get the value after that, but before we start something new. In SQL, we can use SUBSTRING to do that:

SUBSTRING(string, start, length)

Now I’ll say that the person who wrote this function certainly had a reason for the way it works. They are almost certainly a better programmer than I and they very well may have written a function that works for the majority of use cases. But this function is a problem for me because I don’t know the length of the string I am trying to grab. My UTM Source could be f-a-c-e-b-o-o-k (8) or t-w-i-t-t-e-r (7) or, like it is in this case p-j-n (3). To figure out the third argument here, we’ll have to get clever. But first, let’s figure out the other two.

Our first argument is simply going to be a reference to our string. The second is the starting index of our substring. In order to determine where to start, we’ll have to use another SQL function: CHARINDEX. This function accepts two arguments:

  1. The substring we’re searching for
  2. The string to pull it from
CHARINDEX('utm_source', REQUEST_URL)

In our case, this returns an index of around 311. If we actually count it out, we’ll see that this is where our argument string starts (note that this index starts at 1, not 0 like we’re accustomed to from other languages). This is critical because in SUBSTRING, we want to pass an argument of where the string utm_source ends, since that’s where our value is going to begin. To illustrate, if we combine the two functions to create a column:

SELECT SUBSTRING(REQUEST_URL,
CHARINDEX('utm_source', REQUEST_URL)
) FROM SIGN_UP_EVENTS;

Our result will be something like this (our target string in bold):

utm_source%3Dpjn%26utm_medium%3Daffiliate%26utm_campaign%3D242132%26cvosrc%3Daffiliate.pepperjam.242132%26clickId%3D3807140982&t=n&pv=2

All we want to pull is pjn, so we’ll have to add a third argument to get the length of our substring right. Before that, we’ll have to adjust our starting point because we don’t want to include utm_source%3D (%3D is an encoded equals sign). To correct this, we’ll increase our starting index by 13 because that’s the length of utm_source%3D. Here’s our updated query and result:

SELECT SUBSTRING(REQUEST_URL,
CHARINDEX('utm_source', REQUEST_URL) + 13
) FROM SIGN_UP_EVENTS;
# Result:
pjn%26utm_medium%3Daffiliate%26utm_campaign%3D242132%26cvosrc%3Daffiliate.pepperjam.242132%26clickId%3D3807140982&t=n&pv=2

To isolate pjn, we could pass a third argument of 3, but that isn’t dynamic. To make this work in my case, I determined that my target string would consistently be followed by %26, the encoded version of &. I’d love to simply list the index of that character as my third argument, but that’s not a length, so here’s what I did:

SELECT SUBSTRING(REQUEST_URL,
CHARINDEX('utm_source', REQUEST_URL) + 13,
CHARINDEX('%26', SUBSTRING(REQUEST_URL,
CHARINDEX('utm_source', REQUEST_URL)) - 14)
FROM SIGN_UP_EVENTS;

To get the space between two substrings, I looked for the CHARINDEX of %26, but rather than using the base string, I used the substring that starts from utm_source. The index of the first instance of %26 after utm_source is 17. That’s because we have to account for the 13 characters in utm_source%3D, then 3 more characters from pjn before we finally hit 17 from the beginning of %26. To get the length of the characters between utm_source%3D and %26, we subtract 14: 13 characters up front and then 1 more to account for the index starting one space after the string we’re looking for.

Building a Solution

There are a few answers on Stack Overflow that do something similar to what I’ve done here, but it’s not always easy to find an explanation for why they work. This may be a pattern that we have to use again in the future, so I wanted to make sure it was written down with thorough reasoning behind the numbers. Feel free to give it a try at home! And don’t hesitate to comment if there’s a better way. I’m always looking for that leg up!

Sources

--

--

Mike Diaz
Mike Diaz

No responses yet