Originally posted here SQL do’s and don’ts
- Why bother?
- Patterns to watch out for
- SELECT *
- The use of DISTINCT
- UNION vs. UNION ALL
- Abuse of CASE WHEN
- References
Why bother?
When I first learned SQL, my mindset was mostly “Meh, if it runs, it’s done”. Over time and with some reality checks, I have gone to the opposite of the spectrum and learned to — for lack of a better word —‘scrutinize’ SQL queries, for 2 main reasons:
- Accuracy (wrong joins, wrong aggregation, etc.)
- Performance (imagine a production run that take hours to complete)
There are financial costs involved, and there are also other hidden costs in both scenarios. Each minute spent on debugging the problematic query is a minute you cannot take back, and it prevents your team from doing other meaningful work, not to mention — impatient data consumers is no fun, and each time this happens, the trust of the organization in data team dies a little.
So let’s dive in with some typical SQL patterns to watch out for.
Patterns to watch out for
- SELECT *
- DISTINCT, frequent misuses, and solutions
- UNION vs UNION ALL
- CASE WHEN, when to use and not to use it
There are many more, let’s focus on these first in this post.
SELECT *
Simply put, for performance sake, just no. For more ‘but why’, give this read¹ a go.
SELECT <only the columns you need> FROM table
The use of DISTINCT
DISTINCT is often misused
Here are some common cases that result in expensive queries, wrong result, or both.
using incorrect joins resulting in duplicated rows — so you use distinct
to solve the problem. This looks like this: SELECT DISTINCT * FROM table
or SELECT DISTINCT [a bunch of columns] FROM table
mistaking SCD type 2 table to be “duplicates” and apply distinct
to extract “unique” record. This results in data loss and misleading analysis.
combining SELECT DISTINCT
with GROUP BY
with aggregation… just because you can. The distinct
clause in this case is redundant and should be removed.
DISTINCT deduplicates records, but it is costly². So it should be used sparingly and with valid reasons (else you may end up with wrong result). Above cases are not valid use cases of distinct
. If you find yourself using any of the above, it suggests a second look at your table’s primary key, or review your use of joins (also related to misunderstanding of the primary keys and the relationship among tables in your dataset).
A note on hidden “distinct” execution — SELECT a FROM table GROUP BY a
. In this case, the query return similar output to SELECT DISTINCT a FROM table
but at the cost of “clarity of intent”. A reader might easily bypass the last GROUP BY
clause and treat it just as another regular SELECT
clause. Unless you have to use the HAVING
clause at the end, or there is some significant performance improvement, there is no valid reason to use this.
Valid cases for DISTINCT
There are very few valid use cases for distinct.
You try to find distinct values of a column1
# this works great SELECT DISTINCT status FROM sale_orders# An alternative, NOT faster, but add insights to your EDASELECT
status,
count(*)
FROM sale_orders
GROUP BY 1
You are deduplicating because there is actual duplications, like identical user records with same created timestamp
and everything. This is rare, and worth investigating the root cause to prevent more records like so.
You are counting the unique values of a column with COUNT(DISTINCT something)
Beware of using DISTINCT on SCD type 2 tables
In the case of SCD type 2³ , it’s important to look at specific question you try to answer, rather than just blanket distinct
and get a wrong result to your analysis (despite the fact that performance-wise, distinct
will beat some of the below queries) — performance should not be at the expense of accuracy.
Suppose we are having a user
table that stores all historical records per each user.
# This query returns all emails a user has used but distract the analysis from the fact that a user has changed their active email address over time:SELECT DISTINCT user_id, user_email FROM user# If you only care about latest active email of each user, use ROW_NUMBER() and QUALIFY instead, this will return only SUBSET of your historical records (active email only):SELECT
user_id,
user_email,
ROW_NUMBER() OVER (
PARTITION BY user_id
ORDER BY created_at DESC) as rn
FROM user
QUALIFY rn = 1# Depends on your business use case, sometimes it's best to keep all historical records. This will return FULL records, with an identifying column for the latest record:SELECT
user_id,
user_email,
ROW_NUMBER() OVER (
PARTITION BY user_id
ORDER BY created_at DESC) = 1 as is_latest_recordFROM user
UNION vs. UNION ALL
Using union
is similar to using distinct
in the process (it will scan all rows in your CTEs, stack them together, then perform de-duplications before returning final output, whereas union all
doesn’t do de-duplication and simply return all results). Whenever there is such distinct sorting⁴ involved, you want to make sure the use case is valid, and there is no better way of doing so.
Instead of UNION
, make sure deduplication logic is done upstream and is valid for the business use case, then use UNION ALL
instead.
Abuse of CASE WHEN
I have seen SQL script with a 1000+ lines of just CASE WHEN... THEN...AND...OR...
at worst, though to be fair some of it was due to linting. Digging a bit deeper, folks often use it for:
Mapping endless values
My re-enactment of an exemplary real-world abuse of CASE WHEN. The lack of linting is just 10% of the pain when you read these.
Aggregation at different granularities — and to avoid wrong aggregation in downstream tables / by unaware end-users, case when
is used.
Some other uses that will unnecessarily stress your computing warehouse, despite looking harmless with small record tables, but little wins here and there can go a long way when your business (and data) grows.
Combining aggregation with CASE WHEN and apply it as a filter condition. Instead of this — apply CTE to define the aggregation and use WHERE instead.
CASE WHEN is a handy function to apply branching in logics, if this then that. But imagine applying it onto a table with 1000 rows, and you have just 10 scenarios of when
, that’s already 10,000 iterations for your data warehouse to handle.
In real world, our tables are rarely just 1000 records, if at all, it will grow with time, to millions, at which point it has become 10,000,000 iterations. Not to mention when you don’t do a good enough job of data input quality control, you will always have to add more when
to your query to account for new creative inputs. Behold query that runs forever.
This means applying CASE WHEN
should be considered carefully, unless you want to sit on several hours of table transformation and hear end-user complaints on stale dashboards, or just go back and forth with warehouse re-sizing on infrastructure that doesn’t do auto-scaling.
For more deep dive into execution process behind CASE WHEN
you can check out this site⁵.
Some tips for CASE WHEN:
Only apply it when you have a limited, controlled list of branching logic. For instance you know that there will only be 3 distinct values of order_status
: drafted
, ordered & unpaid
, paid
. But for some reason your raw data is numerical 1,2,3
for each order status, and to make it human-friendly you want to map it to those 3 readable statuses, then case when
it away.
When it’s likely the branching logic will grow, for instance you have a list of business_partner_email
that you want to group into business_partner_domain
or some other categories for report purposes, there are a few options:
- use regex functions if there are exploitable patterns.
- or leverage business owner’s insights, empower them with a google sheet to maintain the mapping of categories, and include it in your daily transformation job and integrate the logic via
JOIN
instead of endlessCASE WHEN
When it’s a matter of aggregating at different granularities, ask yourself “Why?” first instead of marching forward. Most of the times it’s very likely due to a poorly thought-out model.
An example:
- You have
order
table, withorder_id
as primary key (PK). - You also have
order_line_item
table, withorder_line_item_id
as PK — representing each purchasing item in an order basket. The lowest granularity in this case isorder_line_item_id
- You want to have a
fact_orders
that store both purchasing item information and the order information as well — this suggests pulling from these 2 tables, and the PK will beorder_line_item_id
, then somehow you also want to include ancustomer_lifetime_order_value
column in there — which can only be calculated using:SUM(order_value) OVER(PARTITION BY customer_id)
. But then you realize it would duplicate this value per each row of order line item. Without thinking it through you might end up with something looking like this, because you want to avoid wrong aggregation downstream:
CASE WHEN ROW_NUMBER() OVER(PARTITION BY customer_id ORDER BY some_timestamp) = 1
THEN SUM(order_value) OVER(PARTITION BY customer_id)
ELSE 0
END as customer_lifetime_order_value
While this kind of logic isn’t wrong in this table, there are some inherent risks:
- an unaware developer of downstream tables might just use an average on this column to get
avg_lifetime_order_value
, so it will be inclusive of all the 0 and yield wrong result. Even with the awareness of all those zeros, they would end up working around that problem with another clunkyCASE WHEN
logic. Imagine every downstream layer has to apply this to get accurate result. - if you are thinking “If so I would not apply such case when logic and keep them all duplicated” to avoid wrong average downstream, then you face another potential scenario: innocent divisions, like so
SELECT
customer_id,
sum(customer_lifetime_order_value) / count(distinct order_id)
as avg_order_valueFROM fact_orders
GROUP BY 1
— in this case, you end up with very wrong result as the sum has already multiplied the lifetime order value by several times.
- unaware report user doing the same thing on visualization tool, even with the best of intention and documentations.
- ultimately, why add such unnecessary complications (plus query time)?
An alternative: isolate customer related attributes to form a fact_customers
table — which is better organization of data for different analysis use cases, with clearly defined primary keys of each table, and aggregations at that PK level. Not only is it neater, it also helps you avoid performance issues for the warehouse and the BI queries (say, your report users add in a few filters on customer attributes that are joined from another dim_customers
table). Imagine you are writing the above code for a retail business with 300 million rows of order line items generated each month — the cost would add up quickly.
Above are some selected patterns that are pretty common, and some potential traps we need to avoid for accuracy and better query performance. I hope to expand on this list in some more posts as well. Happy SQL-ing!
References
- Data Modeling and Why It’s Important - February 24, 2023
- SQL do’s and don’ts - November 3, 2022
- Data modeling pitfalls and where to find them - September 23, 2022