Data Pipelines and Transformation

dbt Incremental part 2: Implementing & Testing

Incremental Strategies & How to Use Them

In Part 1, I covered the topic of choosing the appropriate incremental strategy, which significantly impacts the cost and time of each table.

In this blog, I’ll discuss how I implement and test incremental models in the J4J project.

1. Implementing

I will take the fct__jira__issue model as an example.

1.1 Config Incremental Strategy

The _fivetran_synced timestamp was skewed because, at that time, we had just switched from Stitch to Fivetran to ingest Jira data. So, in this case, we chose the merge with clustered strategy.

But if the insert_overwrite strategy was possible, the config block could be as below:

1.2 Choose Timestamp Column and Creat Time Range Filter

It is recommended to choose the synced timestamp of the ingestion tool (e.g., _fivetran_synced) for all incremental models.
Find out the last modified timestamp using GREATEST() function:

The final fact model is built from several staging models. If we simply choose the timestamp from one staging model, the fields from other staging models might not be updated. Therefore, we must find the most updated timestamp for each row. In this case, I used GREATEST() to pick the last modified _fivetran_synced timestamp.

Read more about GREATEST() function HERE.

Set up a time range filter to sort out data that need to update:

1.3 (optional) Add `build_source` Column

Later on, the build_source field enables us to assess whether the incremental models have been configured properly. However, it is not a compulsory step.

1.4 Putting It All Together!

2. Testing

In order to debug faster, it is recommended to audit tables from downstream to upstream models (base -> stg -> int -> fct)

To check whether there is a discrepancy between the full-refresh model and incremental model, we should take the following steps:

2.1 Create Full-refresh and Incremental Model

Since the incremental model only cares about recent data, so in J4J, we only compare data in the last 3 days. Therefore, we need to 2 types of model:

(1) completely full-refresh model

(2) model that is built full refresh before N-3 (N = current date) and incremental in the last 3 days and store in custom schema, e.g. dbt_na_incremental_test_mart

To create the latter type: