What I Learned at Work this Week: The Query Plan

Mike Diaz
5 min readFeb 12, 2023

--

Photo by Suzy Hazelwood: https://www.pexels.com/photo/notebook-1226398/

At my job, I’m considered an authority on one specific product — our Events ETL Report (extract, transform, load). I get a lot of questions about the report, so I probably check Airflow logs that run its script a dozen times a day. Clients like to have their data as soon as possible, so the job runs overnight and I expect it to be finished by the time I start working in the morning. But two weeks ago, I found that the job was still running at 9am, and wouldn’t finish until noon.

Finding a resolution was challenging, but it gave us the opportunity to look closely at the job’s runtime and efficiency. There were a few separate queries that took more than 10 or 20 minutes, and one that regularly took an hour or more. The incident helped my team get some attention from more senior engineers, one of whom pointed out a smart way to improve one of the offending queries.

The Query

In our report, events are often connected to visitors via a visitor_id. If we want to add more descriptive data to the rows, we can pull visitor_metadata from a separate table:

SELECT count(id)
FROM visitor_metadata vm
WHERE DATE(vm.created) = :startDate
ORDER BY created;

This is a pretty simple query, but it can run for a very long time because the visitor_metadata table is gigantic. We received feedback that it would run faster if we did this instead:

SELECT count(id)
FROM visitor_metadata vm
WHERE vm.created >= :startDate
AND vm.created < :startDate + INTERVAL 1 DAY
ORDER BY created;

We can see the difference here is that we’re using >= and < to build a date range rather than converting the date values and then using =. The conversion is necessary if we want to use = because vm.created is a timestamp and we’re going to give it a date for comparison. But using the range saves us from having to run the conversion:

# timestamp           |   date
'2023-02-07 10:15:43' != '2023-02-07' # equals implies an exact match, which we dont have here
DATE('2023-02-07 10:15:43') = '2023-02-07' # Using DATE converts the timestamp, so these become equal
'2023-02-07 10:15:43' >= '2023-02-07' # Adding the > will make this True without conversion

A test run of the original query took more than 2 minutes to complete. After the suggested change, the query took less than 1 second. The engineer who came up with this idea explained that the DATE function was running on every single row in the table. But opting for the range requires fetching far fewer rows. Clearly the change in runtime meant he was on to something, but how did he know?

EXPLAIN

I’m sure there are a lot of different ways that this engineer could have figured out that using DATE here is inefficient. I didn’t think to ask, so when I was trying to understand it myself, I sought out the MySQL Explain Output Format. By simply running your query with EXPLAIN in front, you can see detail about how it’s being run:

EXPLAIN SELECT count(id)
FROM visitor_metadata vm
WHERE DATE(vm.created) = :startDate
ORDER BY created;

This is the result of that command:

Let’s consult the documentation to understand what each of these columns represents:

  • select_type: This has to do with the type of query being run. Ours is SIMPLE because we’re using a SELECT without a UNION or subqueries.
  • table: In this case, we’re only hitting one table, visitor_metadata or vm. But if our query used JOIN to pull in multiple tables, or used subqueries, each table would have their own row in the EXPLAIN result.
  • partitions: Another case of our query being too simple for this to matter. MySQL provides more info on partitions here.
  • type: This specifically refers to the JOIN type. In this context, the JOIN type indicates complexity, ranging from system and const which have to read only one row thanks to indexing, to all. That’s right, ours here is the most complex and, according to the docs, either “not good” or “very bad.” All refers to the rows that are read for the query — all of them. Because we’re filtering by date and we don’t index by date, this query has to read every single row in visitor_metadata to give us results. That’s bad.
  • possible_keys: Here we’d see a list of all possible indexes for the table. Knowing the available indexes can help us improve our query, but this table doesn’t have any.
  • key: This is the specific index that MySQL decided to use for our query. Again, it’s null because there’s no indexes in the table.
  • key_len: The length, in bytes, of the key we’re using. This can help us understand how the key is being used, since different key checks will be different lengths.
  • ref: This column shows which columns or constants are compared to the index.
  • rows: Finally, something we’ve used. The number here, 130,199,826, is how many rows the query thinks it will have to examine to complete. We’re starting to get an idea of why this takes a long time.
  • filtered: If we can filter out some of the rows from the previous column, it’ll save us time. The value we see here, 100, means that *zero* rows will be filtered and we’ll have to go through all 130 million rows for our query.
  • Extra: Finally, we have some extra information about how the query is run. There are a lot of possibilities for this column, but once again our simple query yields a simple answer — we’re Using where to filter out results.

Comparing the Queries

Now, for the moment of truth. What looks different if we run the new and (apparently) improved version of this query through EXPLAIN:

EXPLAIN SELECT count(id)
FROM visitor_metadata vm
WHERE vm.created >= :startDate
AND vm.created < :startDate + INTERVAL 1 DAY
ORDER BY created;

Everything is the same…except for one column! filtered went from 100 to 11.11 and we can probably guess why. If DATE has to hit every single row before the WHERE clause can kick in and start eliminating rows, we’re going to read the full value of every row in this table. But our new syntax is filtering much more efficiently, likely on a byte-by-byte level in eliminating rows before it reaches the end of a timestamp value.

I’ve already started looking at other queries to see if I can improve their performance with the same change. Beyond that, I feel prepared to analyze even more complex queries and really speed up this job. It took a frustrating incident to get there, but now we’re all better for it, and our script is running better than ever.

Sources

--

--