What I Learned at Work this Week: SQL Subselects

This week, there was some discussion in the office about the costs and benefits of subselects in SQL. As usual, the conclusion is that subselects are useful in certain situations, but unnecessary or detrimental to runtime in other situations. I had trouble following the conversation because, as you might have guessed, I wasn’t sure what we were talking about! We studied SQL for about two days during my Flatiron School course and I had retained very little of that knowledge. But subqueries are certainly useful and it’s always a good time to brush up on SQL!

SQL stands for Structured Query Language and we use it to query a database. When we write SQL, we ask a question that can best be answered by looking through a bunch of digital data that’s been organized into rows and columns:

SELECT * FROM purchases WHERE coupon_used=FALSE;

Even if your SQL is rusty like mine was or if you’ve never studied it, there’s a good chance that you can read this. We want to SELECT and display all columns (all comes from the asterisk, or wildcard *) FROM rows in a table called purchases WHERE a column called coupon_used has a value of FALSE. My hypothetical result might look like this:

Query and result — voila!

Now that we’re up to date on queries, the definition of a subquery (or nested query) is about what you’d expect: it’s a query that has another query inside of it. In our previous example, we pulled data for purchases that were made without coupons (coupon_used=FALSE). That was really straightforward because coupon_used is a column in the purchases table. But what if we wanted to filter purchases by a value that isn’t contained in this table?

Our purchases table references users with IDs, so it stands to reason that there’s another table called users that contains additional data about…users. Let’s say that we want to pull results from our purchases table for users who are part of our Premium Rewards Program. That’s a problem because premium_rewards isn’t a column in the purchases table. We can only query from a table using values from that same table!

These tales are making me dizzy.

We want to know about purchases made by users who fall into a certain group — it would be very helpful if we could isolate the user_id for each of those users and use it in a SELECT statement. Something like this:

SELECT * FROM purchases WHERE user_id IN (18, 19, 365);

The problem here is that we don’t know the user_ids of all users who have signed up for the Premium Rewards Program. Of course we could go over to the users table, query it for that information, write down the numbers, and then put them into the above query, but that’s pretty impractical. And of course if any new users sign up for the program, we’ll have to run the first query all over again before running the second one. A subquery will allow us to run both queries at the same time:

SELECT * FROM purchases WHERE user_id IN (SELECT id FROM users WHERE premium_rewards=TRUE);

The outer query will use the information from the inner query, which selects a list of ids. When all is said and done, we’ll just see the purchases associated with those ids.

The SQL savvy may have been waiting for this one, as using JOIN is a common way to cross-reference data across tables. Rather than making a query to two different tables, we can use JOIN to combine the two tables:

SELECT * FROM purchases p JOIN users u ON p.user_id=u.id WHERE premium_rewards=TRUE;

When we combine our two tables, we must establish how the various rows correspond. In this case, we’re joining the two tables by saying that rows match if the user_id value from purchases matches the id value from users. The p and u after purchases and users create variables that can be referenced later in the query.

We can imagine that using subqueries allows us to reference two different tables within the same query. Using JOIN is more like selecting from one larger table. In this example, the purchases table has columns for id, user_id, total, and coupon_used, but JOIN gives us access to a table with rows that look like this:

Not only do we see the data from our purchases table (p.id, etc), but we see the corresponding data from our users table so long as id from users matches user_id from purchases. At this point, it’s easy to filter by premium_rewards=TRUE because it’s just a column in our larger table.

Though we think of them differently, JOINs and subqueries often run similar logic behind the scenes, merging two tables together and then querying from the larger dataset. JOINs are often more efficient, but there are some situations where only subqueries will get the job done (usually when we’re using our query to aggregate data).

Since subqueries can put more stress our our systems than JOINs, we have to be thoughtful when we use them. In the workplace when dozens or hundreds of engineers could be running simultaneous SQL queries on sprawling databases, this risk is magnified. Hence the concern among my team at unnecessary use of subqueries. They’re not outright banned in our workplace, but we’ll often see refactor requests if they’re included in PRs.

If you want to better understand where subqueries are necessary or how they differ from JOINs, check out some of the resources I used to write this article. There’s a lot to learn!

Solutions Engineer