AI & ML

Common pitfalls in ML projects and how to avoid them.

The start of the story…

Your organization has reached its maturity regarding data analytics. A robust & scalable data pipeline was built. A series of dashboards were put to good use by all departments. Self-service analytics was implemented. The CTO decided time had come for expanding data use-cases to machine learning and artificial intelligence.

The newly-joined DS was so excited to do his job. Within just a few weeks, POC of the first predictive lead scoring model in a form of a single notebook was delivered. Validation results were promising. Stakeholders were eager to launch a new campaign using your team’s bleeding-edge ML model. Things went well only up to this point before your data team found yourself facing a myriad of tough challenges:

Actual campaign results was a disappointment. Everyone was frantically looking for the causes:
- You investigated the work that your beloved DS had solo the whole time, only to realize it was a gigantic mess. Transformations were not aligned with business logics. Train dataset was at wrong granularity level. Train & validation datasets had overlapping records. Some features were 90% null since the last time the notebook was run. Some features made no sense but you couldn’t find any descriptions.
- Stakeholders wanted weekly update about your team’s black-box ML models: EDA insights, hypotheses, model results. You found yourself spending hours just copy-pasting charts in Python Notebooks into google slides, repeatedly.
Even if the campaign trials proved your model was accurate enough in segmenting customer base, you struggled to productionize the ML model. How to set it to run on a regular basis with the least changes to the existing architecture? How to monitor its results frequently & effortlessly? How to safeguard it against changes from the upstream?
You realized DSs and AEs were working in silos. DSs were blindsided on a bunch of helpful metrics that AEs had curated for the latest KPI report. AEs were repeating the same transformation codes that DSs had written long ago in their own notebooks, with a few variations.

Does this sound familiar to you?

We’ve been in this situation with countless projects in the past. And it took the whole team several brainstorm sessions to figure out the root cause and a systematic solution.

The cause of all the pains…

Below diagram illustrates the legacy workflow that our data team had followed before:

For reporting requests from stakeholders, our analytics engineers took care of data ingestion, cleanup and transformation using modern data stack and dbt. Data Analysts used off-the-shell metrics from mart models to build BI dashboards or ad-hoc analyses to help business team make data-driven decisions.
For ML requests, our Data Scientist swam among the sea of dbt models to find out what they needed. They could join a mart model with other int or staging models to get the desired train dataset. The entire ML development were done within a single python notebook file. No version control could be made on such large .ipnb file.

Feature Engineering

DS conducted feature engineering in a notebook file to build data inputs for the predictive model. Such data was not physically stored in the data warehouse’s disk but rather saved as an in-memory intermediate pd dataframe. This practice has four major implications:

Lacking tests: Intermediate dataframes are not gone through rigid testing mechanism as normal dbt models (primary key test, non null test, values validation, etc). This pose a huge risk on the data quality and prediction accuracy.
WET codes: Transformation used in feature engineering step can not be re-used for other ML models or BI request. This is against the DRY principle which aims to reduce the repetition of code.
Lacking documentation: No data catalog or descriptions of features used in the model is maintained. This hampers the collaboration among team and hand-off process in case of the model owner changing his/her position.
Lacking guardrails against impactful changes in upstream: Since no DAG is available, AEs can not foresee the impact of any changes they’re about to make to the upstream tables on the ML models. In a same way, the DS are not timely informed of any changes in the inputs (aka data drift) that can degrade the predictive models.

Model evaluation

Team lacked the continuous monitor of model evaluation metrics like accuracy, F1-score, recall, precision, etc. The model was evaluated once before it was officially deployed to production.

Over time many model’s assumptions no longer hold true for changes in the macro/micro environment or changes in the user behaviors. This often causes a well-known issue called ‘model drift’. A system in place to alert model drift and trigger corrective actions like re-training the model is essential to prevent any erroneous prediction to cascade into the erroneous business action.

Orchestration

We used dbt cloud to orchestrate dbt models to build production tables in DW. However dbt cloud can not orchestrate .ipynb files where we store the ML models. It means we would need to consider switching to a different orchestration tool to accommodate two different files (.sql and .ipynb), or separating ML pipeline with dbt pipeline. Both approaches are cumbersome, costly and require many engineering resources.

Visualization

No BI tool was used to communicate ML outputs with stakeholders. Our main medium of sharing info with non-tech audience was slides. It was clearly very time-consuming. When the model was re-run, or data was updated, we had to manually copy paste the output to the slides. We did try using notebooks with descriptions in text cells as the alternative but it didn’t work out well since the stakeholders were not fond of the unhideable code cells.

The end: happily ever after!

Once the issues were identified, its so much easier to design a solution that enables all data roles to collaborate efficiently, improves time to market and boosts our model’s accuracy.

Our solution is centered around four important changes:

Utilize dbt python: This enhances data governance for ML models by leveraging powerful dbt features like data tests, data documentation and DAG. dbt python also allows us to utilize dbt cloud for orchestrating both data transformation & ML workflow.
Modularize ML models: Thanks to dbt python, we can break down a single have-it-all python notebooks into different steps, thus allowing code utilization and easier model productionization. We can also use version control for the dbt python files, which is otherwise impossible for a notebook file with too many cells’ outputs.
Feature engineering is made in SQL data models: Both AEs and DSs use a same language and place to curate and store metrics for different usages. It creates a single source of truth for both reporting and ML models. It also eliminates duplicative efforts in our data team.
Adopt HEX as our BI platform in a ML project: HEX solves our pain points in the last mile of ML development thanks to its below features:
- no-code/ low-code data visualization to save engineering efforts
- allow customizing data visualization by Python/R
- have user-friendly UI that hides sophistical codes behind the scene

Recommended Resources

This exercise is a perfect demonstration of the workflow I proposed above.Leverage dbt Cloud to Generate ML ready pipelines using Snowpark python

Author
Recent Posts

Na Nguyen Thi

Lead Analytics Engineer at Joon Solutions Global

Data Analyst | Business Intelligence | BI Consultant

Our Data Solutions

Our Industry Focus

Our Data Solutions

Our Industry Focus

AI & ML

Common pitfalls in ML projects and how to avoid them.

The start of the story…