After years of working with SQL and optimizing hundreds of poorly written queries, I decided to write this post to encourage use of what I found to be one of the SQL’s most underappreciated features – Common Table Expressions, often known as ‘WITH clause’.
What is a Common Table Expression
In short, CTEs are about taking parts of a SQL query, naming them, and referring to them later. Just like in this example.
WITH some_name AS ( SELECT column_1, max(column_2) as max_2 FROM table_1 GROUP BY column_1 ) SELECT t2.column_3, s.max_2 FROM table_2 t2 JOIN some_name s ON t2.column_1 = s.column_1
If you need more details, have a look here
1. Readability
How many times have you seen queries like this?
SELECT * FROM ( SELECT column_1, max(sum_1) as max_1 FROM ( SELECT column_1, column_2, sum(column_3) as sum_1 FROM table GROUP BY column_1, column_2 ) GROUP BY column_1 ) WHERE max_1 < 3
Obviously, somebody else wrote the query, you have to analyze/modify/fix it, and it is much longer and more complicated than the example above (a few extra joins, dozens of columns etc). So:
- For a complex query, it takes you some time before you even figure out where the logic starts
- You go crazy as you keep scrolling up and down the editor, because the column names are at the top, and a corresponding WHERE condition is at the bottom, 500 lines down.
Isn’t this version easier to read?
WITH q_1 AS ( SELECT column_1, column_2, sum(column_3) as sum_1 FROM table GROUP BY column_1, column_2 ), q_2 AS ( SELECT column_1, max(sum_1) as max_1 FROM q_1 GROUP BY column_1 ) SELECT * FROM q_2 WHERE max_1 < 3
2. Logical flow
In the rewritten query with CTEs the flow goes from the top to the bottom. Furthermore, instead of names like q_1 you can use more meaningful ones like t1_sum (in the first example, subqueries are not named at all). Together with proper commenting, your SQL code will be much better reflecting the business logic.
3. Less code
Let’s have a look the following example:
SELECT t1.col_1, t2.col_3, t3.col_3 FROM t1 JOIN (SELECT DISTINCT col_2, col_3 FROM xyz WHERE col_2 = 2 and col_3 < 15) t2 ON t1.col_2 = t2.col_2 JOIN (SELECT DISTINCT col_2, col_3 FROM xyz WHERE col_2 = 2 and col_3 > 22) t3 ON t1.col_2 = t3.col_3
You probably have noticed, that parts of the query repeat. These can be extracted into CTEs and rewritten like this:
WITH dist AS ( SELECT DISTINCT col_2, col_3 FROM xyz WHERE col_2 = 2 ), t2 AS ( SELECT * FROM dist WHERE col_3 < 15 ), t3 AS ( SELECT * FROM dist WHERE col_2 < 22 ) SELECT t1.col_1, t2.col_3, t3.col_3 FROM t1 JOIN t2 ON t1.col_2 = t2.col_2 JOIN t3 ON t1.col_2 = t3.col_3
In this particular example, the savings in code length might not be so spectacular, but as you start rewriting your queries, you’ll discover that many repeating parts can be eliminated. The authors of the database you’re working with probably also made some observation about repeating, so please read the next section carefully,
4. Performance and hints
Contrary to what many inexperienced users believe, a SQL statement is not executed ‘as a whole’, but it a series of operations like reads, joins, sorts, unions etc, executed most often sequentially, sometimes in parallel to each other, sometimes in a paralleled way. It depends on the specific database engine how CTEs are processed, but proper use of them is a very powerful tool when tuning SQL queries. I will provide 2 examples how Oracle can take advantage of them:
- If a CTE is referred to in the further part of the query more than once, it may get materialized. This means it is saved to a temporary table, so that the preceding operations are not repeated
- Using hints, like MATERIALIZE, it is possible to force saving part of a query to a temporary table. This, in turn, influences all the joins downstream, as the temporary table has no indexes, and is hash-joined whenever the join conditions permit
On the other hand, there are engines like Redshift, which do not materialize CTEs. That should not discourage you from using CTEs to structure your code.
5. CTE recursion
Another feature that is database specific is CTE recursion, or otherwise the ability for the query to reference itself, to generate new rows basing on rows so far. The implementation differs between the databases, but if your worked with CONNECT BY queries in Oracle and now moved to MS SQL server, you’re likely to run into them. Google for more.
No more 10 level subqueries
I hope my arguments were convincing, and will help you write cleaner, better SQL queries. Not only for yourself, but also for those who will be debugging or changing them in the future.