Incremental Strategies & How to Use Them
In Part 1, I covered the topic of choosing the appropriate incremental strategy, which significantly impacts the cost and time of each table.
In this blog, I’ll discuss how I implement and test incremental models in the J4J project.
1. Implementing
I will take the fct__jira__issue
model as an example.
1.1 Config Incremental Strategy
The _fivetran_synced
timestamp was skewed because, at that time, we had just switched from Stitch to Fivetran to ingest Jira data. So, in this case, we chose the merge
with clustered strategy.
insert_overwrite
strategy was possible, the config block could be as below:1.2 Choose Timestamp Column and Creat Time Range Filter
-
It is recommended to choose the synced timestamp of the ingestion tool (e.g.,
_fivetran_synced
) for all incremental models. -
Find out the last modified timestamp using
GREATEST()
function:
The final fact model is built from several staging models. If we simply choose the timestamp from one staging model, the fields from other staging models might not be updated. Therefore, we must find the most updated timestamp for each row. In this case, I used GREATEST()
to pick the last modified _fivetran_synced
timestamp.
Read more about GREATEST()
function HERE.
Set up a time range filter to sort out data that need to update:
1.3 (optional) Add build_source
Column
Later on, the build_source
field enables us to assess whether the incremental models have been configured properly. However, it is not a compulsory step.
1.4 Putting It All Together!
2. Testing
In order to debug faster, it is recommended to audit tables from downstream to upstream models (base -> stg -> int -> fct)
To check whether there is a discrepancy between the full-refresh model and incremental model, we should take the following steps:
2.1 Create Full-refresh and Incremental Model
Since the incremental model only cares about recent data, so in J4J, we only compare data in the last 3 days. Therefore, we need to 2 types of model:
(1) completely full-refresh model
(2) model that is built full refresh before N-3 (N = current date) and incremental in the last 3 days and store in custom schema, e.g. dbt_na_incremental_test_mart
To create the latter type:
- Create a custom filter date range macro:
-
Add custom filter in base models, e.g.
base__jira__worklog
model
2.2 Create Audit Tests
Using audit_helper
package to create tests that compare between the full-refresh model and incremental model.
Start with base model, then continue with upstream models
References
Audit_helper in dbt: Bringing data auditing to a higher level | dbt Developer Blog
Understanding dbt Incremental Strategies part 1/2
Understanding dbt Incremental Strategies part 2/2
Two completely different types of dbt incremental models in Bigquery
- Building a dbt project from scratch - January 15, 2024
- dbt Incremental part 2: Implementing & Testing - January 5, 2024
- Inventory Management & Demand Forecasting for E-commerce Sellers + Tutorial - December 18, 2023