SQL do’s and don’ts

Originally posted here :slight_smile: SQL do’s and don’ts

Why bother?

When I first learned SQL, my mindset was mostly “Meh, if it runs, it’s done”. Over time and with some reality checks, I have gone to the opposite of the spectrum and learned to — for lack of a better word —‘scrutinize’ SQL queries, for 2 main reasons:

  1. Accuracy (wrong joins, wrong aggregation, etc.)
  2. Performance (imagine a production run that take hours to complete)

There are financial costs involved, and there are also other hidden costs in both scenarios. Each minute spent on debugging the problematic query is a minute you cannot take back, and it prevents your team from doing other meaningful work, not to mention — impatient data consumers is no fun, and each time this happens, the trust of the organization in data team dies a little.

So let’s dive in with some typical SQL patterns to watch out for.

Patterns to watch out for

  • SELECT *
  • DISTINCT, frequent misuses, and solutions
  • UNION vs UNION ALL
  • CASE WHEN, when to use and not to use it

There are many more, let’s focus on these first in this post.

SELECT *

❌ Simply put, for performance sake, just no. For more ‘but why’, give this read¹ a go.

SELECT <only the columns you need> FROM table

The use of DISTINCT

DISTINCT is often misused

Here are some common cases that result in expensive queries, wrong result, or both.

❌ using incorrect joins resulting in duplicated rows — so you use distinct to solve the problem. This looks like this: SELECT DISTINCT * FROM table or SELECT DISTINCT [a bunch of columns] FROM table

❌ mistaking SCD type 2 table to be “duplicates” and apply distinct to extract “unique” record. This results in data loss and misleading analysis.

❌ combining SELECT DISTINCT with GROUP BY with aggregation… just because you can. The distinct clause in this case is redundant and should be removed.

DISTINCT deduplicates records, but it is costly². So it should be used sparingly and with valid reasons (else you may end up with wrong result). Above cases are not valid use cases of distinct . If you find yourself using any of the above, it suggests a second look at your table’s primary key, or review your use of joins (also related to misunderstanding of the primary keys and the relationship among tables in your dataset).

❌ A note on hidden “distinct” executionSELECT a FROM table GROUP BY a. In this case, the query return similar output to SELECT DISTINCT a FROM table but at the cost of “clarity of intent”. A reader might easily bypass the last GROUP BY clause and treat it just as another regular SELECT clause. Unless you have to use the HAVING clause at the end, or there is some significant performance improvement, there is no valid reason to use this.

Valid cases for DISTINCT

There are very few valid use cases for distinct.

✅ You try to find distinct values of a column1

# this works great SELECT DISTINCT status FROM sale_orders# An alternative, NOT faster, but add insights to your EDASELECT
  status,
  count(*)
FROM sale_orders
GROUP BY 1

✅ You are deduplicating because there is actual duplications, like identical user records with same created timestamp and everything. This is rare, and worth investigating the root cause to prevent more records like so.

✅ You are counting the unique values of a column with COUNT(DISTINCT something)

Beware of using DISTINCT on SCD type 2 tables

In the case of SCD type 2³ , it’s important to look at specific question you try to answer, rather than just blanket distinct and get a wrong result to your analysis (despite the fact that performance-wise, distinct will beat some of the below queries) — performance should not be at the expense of accuracy.

Suppose we are having a user table that stores all historical records per each user.

# This query returns all emails a user has used but distract the analysis from the fact that a user has changed their active email address over time:SELECT DISTINCT user_id, user_email FROM user# If you only care about latest active email of each user, use ROW_NUMBER() and QUALIFY instead, this will return only SUBSET of your historical records (active email only):SELECT
  user_id,
  user_email,
  ROW_NUMBER() OVER (
         PARTITION BY user_id 
         ORDER BY created_at DESC) as rn
FROM user
QUALIFY rn = 1# Depends on your business use case, sometimes it's best to keep all historical records. This will return FULL records, with an identifying column for the latest record:SELECT
  user_id,
  user_email,
  ROW_NUMBER() OVER (
         PARTITION BY user_id 
         ORDER BY created_at DESC) = 1 as is_latest_recordFROM user

UNION vs. UNION ALL

Using union is similar to using distinct in the process (it will scan all rows in your CTEs, stack them together, then perform de-duplications before returning final output, whereas union all doesn’t do de-duplication and simply return all results). Whenever there is such distinct sorting⁴ involved, you want to make sure the use case is valid, and there is no better way of doing so.

✅ Instead of UNION , make sure deduplication logic is done upstream and is valid for the business use case, then use UNION ALL instead.

Abuse of CASE WHEN

I have seen SQL script with a 1000+ lines of just CASE WHEN... THEN...AND...OR... at worst, though to be fair some of it was due to linting. Digging a bit deeper, folks often use it for:

❌ Mapping endless values

My re-enactment of an exemplary real-world abuse of CASE WHEN. The lack of linting is just 10% of the pain when you read these.

❌ Aggregation at different granularities — and to avoid wrong aggregation in downstream tables / by unaware end-users, case when is used.

❌ Some other uses that will unnecessarily stress your computing warehouse, despite looking harmless with small record tables, but little wins here and there can go a long way when your business (and data) grows.

Combining aggregation with CASE WHEN and apply it as a filter condition. Instead of this — apply CTE to define the aggregation and use WHERE instead.

CASE WHEN is a handy function to apply branching in logics, if this then that. But imagine applying it onto a table with 1000 rows, and you have just 10 scenarios of when, that’s already 10,000 iterations for your data warehouse to handle.

In real world, our tables are rarely just 1000 records, if at all, it will grow with time, to millions, at which point it has become 10,000,000 iterations. Not to mention when you don’t do a good enough job of data input quality control, you will always have to add more when to your query to account for new creative inputs. Behold query that runs forever.

This means applying CASE WHEN should be considered carefully, unless you want to sit on several hours of table transformation and hear end-user complaints on stale dashboards, or just go back and forth with warehouse re-sizing on infrastructure that doesn’t do auto-scaling.

For more deep dive into execution process behind CASE WHEN you can check out this site⁵.

Some tips for CASE WHEN:

✅ Only apply it when you have a limited, controlled list of branching logic. For instance you know that there will only be 3 distinct values of order_status : drafted , ordered & unpaid , paid . But for some reason your raw data is numerical 1,2,3 for each order status, and to make it human-friendly you want to map it to those 3 readable statuses, then case when it away.

✅ When it’s likely the branching logic will grow, for instance you have a list of business_partner_email that you want to group into business_partner_domain or some other categories for report purposes, there are a few options:

  • use regex functions if there are exploitable patterns.
  • or leverage business owner’s insights, empower them with a google sheet to maintain the mapping of categories, and include it in your daily transformation job and integrate the logic via JOIN instead of endless CASE WHEN

✅ When it’s a matter of aggregating at different granularities, ask yourself “Why?” first instead of marching forward. Most of the times it’s very likely due to a poorly thought-out model.

An example:

  • You have order table, with order_id as primary key (PK).
  • You also haveorder_line_item table, with order_line_item_id as PK — representing each purchasing item in an order basket. The lowest granularity in this case is order_line_item_id
  • You want to have a fact_orders that store both purchasing item information and the order information as well — this suggests pulling from these 2 tables, and the PK will be order_line_item_id, then somehow you also want to include an customer_lifetime_order_value column in there — which can only be calculated using: SUM(order_value) OVER(PARTITION BY customer_id) . But then you realize it would duplicate this value per each row of order line item. Without thinking it through you might end up with something looking like this, because you want to avoid wrong aggregation downstream:
CASE WHEN ROW_NUMBER() OVER(PARTITION BY customer_id ORDER BY some_timestamp) = 1 
   THEN SUM(order_value) OVER(PARTITION BY customer_id)
   ELSE 0
END as customer_lifetime_order_value

While this kind of logic isn’t wrong in this table, there are some inherent risks:

  • an unaware developer of downstream tables might just use an average on this column to get avg_lifetime_order_value, so it will be inclusive of all the 0 and yield wrong result. Even with the awareness of all those zeros, they would end up working around that problem with another clunky CASE WHEN logic. Imagine every downstream layer has to apply this to get accurate result.
  • if you are thinking “If so I would not apply such case when logic and keep them all duplicated” to avoid wrong average downstream, then you face another potential scenario: innocent divisions, like so
SELECT
  customer_id,
  sum(customer_lifetime_order_value) / count(distinct order_id) 
     as avg_order_valueFROM fact_orders
GROUP BY 1

— in this case, you end up with very wrong result as the sum has already multiplied the lifetime order value by several times.

  • unaware report user doing the same thing on visualization tool, even with the best of intention and documentations.
  • ultimately, why add such unnecessary complications (plus query time)?

An alternative: isolate customer related attributes to form a fact_customers table — which is better organization of data for different analysis use cases, with clearly defined primary keys of each table, and aggregations at that PK level. Not only is it neater, it also helps you avoid performance issues for the warehouse and the BI queries (say, your report users add in a few filters on customer attributes that are joined from another dim_customers table). Imagine you are writing the above code for a retail business with 300 million rows of order line items generated each month — the cost would add up quickly.

Above are some selected patterns that are pretty common, and some potential traps we need to avoid for accuracy and better query performance. I hope to expand on this list in some more posts as well. Happy SQL-ing!

References

  1. Redirecting you – Medium
  2. Redirecting you – Medium
  3. https://towardsdatascience.com/data-analysts-primer-to-slowly-changing-dimensions-d087c8327e08
  4. T-SQL SET Operators Part 1: UNION and UNION ALL – {coding}Sight
  5. How CASE WHEN Works
Hana Le
Latest posts by Hana Le (see all)
Like what you read? Share with a friend.

Contact Us!

Do you have a question or need more info? Please enter your information and describe your inquiry, and we’ll get back to you as soon as possible. Thanks!

Check out our Case Study!

We’d love to hear more from you. Tell us more about yourself and we’ll message you with our case studies as soon as we receive your message!