SQL do’s and don’ts
Originally posted here SQL do’s and don’ts
- Why bother?
- Patterns to watch out for
- SELECT *
- The use of DISTINCT
- UNION vs. UNION ALL
- Abuse of CASE WHEN
When I first learned SQL, my mindset was mostly “Meh, if it runs, it’s done”. Over time and with some reality checks, I have gone to the opposite of the spectrum and learned to — for lack of a better word —‘scrutinize’ SQL queries, for 2 main reasons:
- Accuracy (wrong joins, wrong aggregation, etc.)
- Performance (imagine a production run that take hours to complete)
There are financial costs involved, and there are also other hidden costs in both scenarios. Each minute spent on debugging the problematic query is a minute you cannot take back, and it prevents your team from doing other meaningful work, not to mention — impatient data consumers is no fun, and each time this happens, the trust of the organization in data team dies a little.
So let’s dive in with some typical SQL patterns to watch out for.
Patterns to watch out for
- SELECT *
- DISTINCT, frequent misuses, and solutions
- UNION vs UNION ALL
- CASE WHEN, when to use and not to use it
There are many more, let’s focus on these first in this post.
❌ Simply put, for performance sake, just no. For more ‘but why’, give this read¹ a go.
SELECT <only the columns you need> FROM table
The use of DISTINCT
DISTINCT is often misused
Here are some common cases that result in expensive queries, wrong result, or both.
❌ using incorrect joins resulting in duplicated rows — so you use
distinct to solve the problem. This looks like this:
SELECT DISTINCT * FROM table or
SELECT DISTINCT [a bunch of columns] FROM table
❌ mistaking SCD type 2 table to be “duplicates” and apply
distinct to extract “unique” record. This results in data loss and misleading analysis.
SELECT DISTINCT with
GROUP BY with aggregation… just because you can. The
distinct clause in this case is redundant and should be removed.
DISTINCT deduplicates records, but it is costly². So it should be used sparingly and with valid reasons (else you may end up with wrong result). Above cases are not valid use cases of
distinct . If you find yourself using any of the above, it suggests a second look at your table’s primary key, or review your use of joins (also related to misunderstanding of the primary keys and the relationship among tables in your dataset).
❌ A note on hidden “distinct” execution —
SELECT a FROM table GROUP BY a. In this case, the query return similar output to
SELECT DISTINCT a FROM table but at the cost of “clarity of intent”. A reader might easily bypass the last
GROUP BY clause and treat it just as another regular
SELECT clause. Unless you have to use the
HAVING clause at the end, or there is some significant performance improvement, there is no valid reason to use this.
Valid cases for DISTINCT
There are very few valid use cases for distinct.
✅ You try to find distinct values of a column
# this works great SELECT DISTINCT status FROM sale_orders# An alternative, NOT faster, but add insights to your EDASELECT status, count(*) FROM sale_orders GROUP BY 1
✅ You are deduplicating because there is actual duplications, like identical user records with same created
timestamp and everything. This is rare, and worth investigating the root cause to prevent more records like so.
✅ You are counting the unique values of a column with
Beware of using DISTINCT on SCD type 2 tables
In the case of SCD type 2³ , it’s important to look at specific question you try to answer, rather than just blanket
distinct and get a wrong result to your analysis (despite the fact that performance-wise,
distinct will beat some of the below queries) — performance should not be at the expense of accuracy.
Suppose we are having a
user table that stores all historical records per each user.
# This query returns all emails a user has used but distract the analysis from the fact that a user has changed their active email address over time:SELECT DISTINCT user_id, user_email FROM user# If you only care about latest active email of each user, use ROW_NUMBER() and QUALIFY instead, this will return only SUBSET of your historical records (active email only):SELECT user_id, user_email, ROW_NUMBER() OVER ( PARTITION BY user_id ORDER BY created_at DESC) as rn FROM user QUALIFY rn = 1# Depends on your business use case, sometimes it's best to keep all historical records. This will return FULL records, with an identifying column for the latest record:SELECT user_id, user_email, ROW_NUMBER() OVER ( PARTITION BY user_id ORDER BY created_at DESC) = 1 as is_latest_recordFROM user
UNION vs. UNION ALL
union is similar to using
distinct in the process (it will scan all rows in your CTEs, stack them together, then perform de-duplications before returning final output, whereas
union all doesn’t do de-duplication and simply return all results). Whenever there is such distinct sorting⁴ involved, you want to make sure the use case is valid, and there is no better way of doing so.
✅ Instead of
UNION , make sure deduplication logic is done upstream and is valid for the business use case, then use
UNION ALL instead.
Abuse of CASE WHEN
I have seen SQL script with a 1000+ lines of just
CASE WHEN... THEN...AND...OR... at worst, though to be fair some of it was due to linting. Digging a bit deeper, folks often use it for:
❌ Mapping endless values
My re-enactment of an exemplary real-world abuse of CASE WHEN. The lack of linting is just 10% of the pain when you read these.
❌ Aggregation at different granularities — and to avoid wrong aggregation in downstream tables / by unaware end-users,
case when is used.
❌ Some other uses that will unnecessarily stress your computing warehouse, despite looking harmless with small record tables, but little wins here and there can go a long way when your business (and data) grows.
Combining aggregation with CASE WHEN and apply it as a filter condition. Instead of this — apply CTE to define the aggregation and use WHERE instead.
CASE WHEN is a handy function to apply branching in logics, if this then that. But imagine applying it onto a table with 1000 rows, and you have just 10 scenarios of
when, that’s already 10,000 iterations for your data warehouse to handle.
In real world, our tables are rarely just 1000 records, if at all, it will grow with time, to millions, at which point it has become 10,000,000 iterations. Not to mention when you don’t do a good enough job of data input quality control, you will always have to add more
when to your query to account for new creative inputs. Behold query that runs forever.
This means applying
CASE WHEN should be considered carefully, unless you want to sit on several hours of table transformation and hear end-user complaints on stale dashboards, or just go back and forth with warehouse re-sizing on infrastructure that doesn’t do auto-scaling.
For more deep dive into execution process behind
CASE WHEN you can check out this site⁵.
Some tips for CASE WHEN:
✅ Only apply it when you have a limited, controlled list of branching logic. For instance you know that there will only be 3 distinct values of
ordered & unpaid ,
paid . But for some reason your raw data is numerical
1,2,3 for each order status, and to make it human-friendly you want to map it to those 3 readable statuses, then
case when it away.
✅ When it’s likely the branching logic will grow, for instance you have a list of
business_partner_email that you want to group into
business_partner_domain or some other categories for report purposes, there are a few options:
- use regex functions if there are exploitable patterns.
- or leverage business owner’s insights, empower them with a google sheet to maintain the mapping of categories, and include it in your daily transformation job and integrate the logic via
JOINinstead of endless
✅ When it’s a matter of aggregating at different granularities, ask yourself “Why?” first instead of marching forward. Most of the times it’s very likely due to a poorly thought-out model.
- You have
order_idas primary key (PK).
- You also have
order_line_item_idas PK — representing each purchasing item in an order basket. The lowest granularity in this case is
- You want to have a
fact_ordersthat store both purchasing item information and the order information as well — this suggests pulling from these 2 tables, and the PK will be
order_line_item_id, then somehow you also want to include an
customer_lifetime_order_valuecolumn in there — which can only be calculated using:
SUM(order_value) OVER(PARTITION BY customer_id). But then you realize it would duplicate this value per each row of order line item. Without thinking it through you might end up with something looking like this, because you want to avoid wrong aggregation downstream:
CASE WHEN ROW_NUMBER() OVER(PARTITION BY customer_id ORDER BY some_timestamp) = 1 THEN SUM(order_value) OVER(PARTITION BY customer_id) ELSE 0 END as customer_lifetime_order_value
While this kind of logic isn’t wrong in this table, there are some inherent risks:
- an unaware developer of downstream tables might just use an average on this column to get
avg_lifetime_order_value, so it will be inclusive of all the 0 and yield wrong result. Even with the awareness of all those zeros, they would end up working around that problem with another clunky
CASE WHENlogic. Imagine every downstream layer has to apply this to get accurate result.
- if you are thinking “If so I would not apply such case when logic and keep them all duplicated” to avoid wrong average downstream, then you face another potential scenario: innocent divisions, like so
SELECT customer_id, sum(customer_lifetime_order_value) / count(distinct order_id) as avg_order_valueFROM fact_orders GROUP BY 1
— in this case, you end up with very wrong result as the sum has already multiplied the lifetime order value by several times.
- unaware report user doing the same thing on visualization tool, even with the best of intention and documentations.
- ultimately, why add such unnecessary complications (plus query time)?
An alternative: isolate customer related attributes to form a
fact_customers table — which is better organization of data for different analysis use cases, with clearly defined primary keys of each table, and aggregations at that PK level. Not only is it neater, it also helps you avoid performance issues for the warehouse and the BI queries (say, your report users add in a few filters on customer attributes that are joined from another
dim_customers table). Imagine you are writing the above code for a retail business with 300 million rows of order line items generated each month — the cost would add up quickly.
Above are some selected patterns that are pretty common, and some potential traps we need to avoid for accuracy and better query performance. I hope to expand on this list in some more posts as well. Happy SQL-ing!