What I Learned at Work this Week: SQL Table Partitioning
The tech team at my workplace had a great learning opportunity this week, as one of our Data Engineers gave a presentation on DB Best Practices. Of course I tried my best to follow along, but my inexperience betrayed me at times. Once I gathered up the courage to follow up and ask the presenter for clarification on one subject, the presentation started to make a lot more sense. But when he asked me if a certain table was partitioned, I got lost again. Blog time!
What is a partitioned table?
Data is incredibly valuable and important to a business, which means businesses will usually collect as much of it as possible. Tables with millions of rows tracking transactions, sign-ups, or impressions are common, but they can still be labor-intensive to query. DB architects work to balance table specificity with usefulness, but they can’t account for every possible query. Table partitioning allows us to split up existing tables to help save our lookups from being too bloated. For example, a partitioned table can turn a query of every transaction ever (millions) to every transaction this year (hundreds of thousands).
A giant table
Let’s consider a “transactions” table. For simplicity’s sake, let’s assume it looks like this:
Partitioning has the potential to help (at least) two things: preventing human error/oversight/confusion and improving query speed. Make note that a partitioned table isn’t necessarily faster to query, but it gives us additional parameters that can help us parse less data and improve performance. In any case, a table with four rows isn’t going to need much optimization for human or SQL readers, so we can use this as an example and imagine a similar table with 40 million rows.
There may be some instances where we would want to query every row in our table, like if we wanted to know our total all-time sales numbers. The query would look like this:
SELECT SUM(amount) FROM transactions;
This query will inevitably read every single row from our table because we’re asking for information from every row of our table. But what if we wanted to generate monthly reports? For October, maybe we’d change our query to this:
SELECT SUM(amount) FROM transactions WHERE MONTH(timestamp) = 10;
This is a different story — we don’t want data from the first row, but our table is still going to read the first row to check if its month is 10. That’s inefficient in a table with 4 rows, but when our table gets up to 40 million, it’s a real problem.
Using Table Partitioning
When we use table partitioning, we take one table and assign its rows into groups called partitions. The rows are not duplicated or placed into a new table, but remain in their original placement with a bit of extra information associated with them that declares which partition they are a part of.
Table partitioning works by looking at a partition column — the column in a table that will be compared among rows. In our example, we would want to use date as our partition column since it contains the data that we’d like to organize by. This column may also be called the partition key. If we choose a partition column that is often used as a filter in queries of this table, SQL server will be able to access only the relevant partitions, which can improve performance. This is called partition elimination.
Table partitioning requires a partition function and a partition scheme. The partition function indicates how we would like our data to be divided. I want to divide my transactions data by month, so I’ll write a partition function that looks like this:
CREATE PARTITION FUNCTION pf_TransactionsByMonth (DATE)
AS RANGE RIGHT FOR VALUES
(‘2020–08–01’, ‘2020–09–01’, ‘2020–10–01’);
We write the CREATE PARTITION FUNCTION command and then name our function pf_TransactionsByMonth. The argument after declares what type of parameter we’ll be providing — it’s DATE in this case because we’re going to be looking at a column with dates in it.
Next we declare whether RANGE is representing an upper bound or lower bound. Since we used RIGHT, it means that we’re looking to make August 1st the lower bound of our partition, meaning everything we collect will be greater than August 1st while still being less than the next boundary value. Remember that, with dates, greater than refers to a date in the future and less than refers to a date in the past.
After declaring the configuration for our RANGE, we declare the values we want to partition by. My imaginary table only covers three months, so I’ve used those.
A partition scheme maps partitions to filegroups. We won’t get into what that means here, but we need a partition scheme to make this function work, so we’ll write a very simple, straightforward one:
CREATE PARTITION SCHEME ps_TransactionsByMonth
AS PARTITION pf_TransactionsByMonth
ALL TO ([Primary]);
Our first line gives our PARTITION SCHEME a name (I used the same convention as I did for the partition function, just changing the prefix from pf to ps). The second line references the partition function we’d like to use for this scheme. And the third line declares which partitions (we chose ALL) to map to which file group (we chose a group called Primary). This code would get more complicated if we wanted to send certain partitions into different filegroups. For example, if more recent partitions were queried more often, we could organize them strategically.
When we create our table, we can invoke our partition scheme, which in turn invokes our partition function and parses out our data.
How does this help?
I mentioned earlier that partitioning can help in two ways, the first of which is to prevent human error or oversight. If our August transaction was just a test and we didn’t want to start counting until October, we could use our partition scheme to keep that data in a different filegroup and prevent it from cluttering other queries. Likewise, if our October data wasn’t yet complete or audited because we’re in the middle of the month, we could use partitioning to exclude that.
Partitioning can certainly help optimize query efficiency, but it must be used in conjunction with the appropriate queries to be effective. If we search through a partitioned table without utilizing the partitions, it will still take a long time to query every row, and could actually take longer than normal in some cases. As with most coding methods, partitioning is a tool that’s only effective when used properly. If you’re up for the challenge, take a look at Kendra Little’s video about partitioning to learn why it’s not always the most effective tool for a job. My other sources for this week aren’t quite as difficult to understand, so feel free to use those to clarify the other points covered here.