Data modeling pitfalls and where to find them
Written by Ha Le
There are many works related to the data modeling task – technical side and business side of things, below are some important lessons I have learned along the way implementing data projects, as DA and AE. This post aims to draw some conclusions on the common pitfalls in data modeling works, why we (me ) may fall into them, consequences, and takeaways. Hope to gather more ideas from you all too.
So, you are in a data team as a Data Analyst (DA) or an Analytics Engineer (AE). Cool! You would work with business owners on data platform projects. The end goal of such project is often a reliable data pipeline, a data warehouse that meets business needs, and meaningful dashboards that supposedly answer every question and deliver all insights possible about the business… yeah.
To get from the outdated manual & labor-intensive reports to beautiful & wise dashboards (that actually help business make data-driven decisions), there must first be a reliable data pipeline in place, a data warehouse to store and analyze data, well thought-out models, and of course – somebody to build the dashboards. All these works obviously require expertise of a whole data team.
Thus, enter data team, studying business requirements and writing user stories, building data infrastructures and setting up platforms, enthused about developing brand-new, consistent models and dashboards to replace the outdated, time-consuming, labor-intensive reports that business owners have had before to run their business.
All is nice and neat until one day, lo and behold, business owners start to give feedbacks like “hey, the number is wrong” , or “this chart is difficult to understand” , or worst, “this dashboard is unusable” (let’s hope we don’t get there)…
What happened, and why?
There are several things that possibly went wrong, but here are a few that could have gone wrong in the data modeling process (stuff that as DA, AE – we can control), which contributed to those scenarios:
You may have forgotten to reference the old ‘source of truth’.
You may have relied solely on the business owners’ inputs on business logics, without cross-checking the data to actually confirm the logic.
You may have failed to follow the business’s naming conventions / jargons.
You may have failed to cross check your modeling result with source database, and/or business front-end applications that the data originated from.
You may have forgotten to include all relevant stakeholders in the process. Thus one day someone end up using the report and give not-so-constructive feedbacks.
Yeah, weirdly, the list is not technical at all. Apparently a lot of things can go wrong if we fail to acknowledge the business side of data modeling.
Pitfall #1: Not referencing old “source of truth”
What is that??
Before a data team showing up at the door and building the platform and fancy dashboards, there will always be some forms of “reports” (most likely in holy Excel files) that the business owners hold dear to their hearts, not because of its particular reliability or accuracy (no, there are always things they wish to be improved upon, hence the data team existence), but because it has been the ONLY source of “truth”they have had for a long long time.
As a result, bear in mind that business owners will most likely rely on old reports to cross check against the figures on new fancy dashboards you have built, because it’s the only tool they have.
Thus, if you fail to refer to those oldies, chances are there will be many things you have missed in data modeling process, things like:
hidden, undocumented business logics (that nobody talk about, but expect that you automatically ‘get it’), often in the form of some random stored procedures/ SQL scripts of previous report creators.
weird filters that hint at potential data quality problems.
manual adjustments or mapping that business owners added in their reports (instead of using configurations in system that generated such data).
the things that business owners deem important (usually shown in the reports, surely, even if they fail to actually discuss it with you directly in those one-on-one meetings).
There is no way around it, just:
Dive in and study those old reports as much as you can.
Share your findings with the whole data team. It’s important that everyone is on the same page about the state of data quality, what is lacking in terms of data sources, and rules specific to that business.
Actively seek insights from business owners (especially the ones that created those oldie reports – they know best) on whatever weird discoveries you have had (things that are contradictory to their inputs, for instance)
With a bit of detective work, patience, and active back-and-forth discussions with key stakeholders in this discovery stage, you could save a lot of hours (and frustrations) fixing model logics down the line, not to mention creating trust with business owners who will eventually be data consumers of those dashboards you built. This process also helps you with:
gathering great insights on business logics (or illogic, sometimes), business workflows that require improvements, and to recommend better business practices.
understanding which type of dashboards/ charts would be easily relatable for the users, and which (if required) would need some explaining work to bring them on board.
Pitfall #2: Relying solely on the business owners’ inputs on business logics…
…without cross-checking the data to actually confirm the logic.
What is that??
“What do you mean? We can’t trust the business owners’ inputs??”
It’s not about trust-worthiness, it’s mostly a problem of “a person cannot know everything”, even in the case of very key stakeholders. Trust data, use business owners’ input more as a general guide instead of unshakable truths.
Lots of time I’d truly rather just go with the first inputs business owners gave me (to save EDA time), and then later found out something weird in the data that contradicts their inputs. The key takeaway is having all data in your hand to explore would actually give a much clearer, more comprehensive and reliable picture on the state of the business, for instance: whether the business have any data governance policy in place, whether the business is properly utilizing their systems, whether users are adequately trained to use such systems, and strictly related to data modeling work: whether the “confirmed” business logic stands.
Failing to cross check business owners’ inputs against data in EDA process would lead to following:
You may miss important clues in data that hint at an outdated / disputable logic that business owners gave you.
Hence, you would waste a lot of time modeling without actually understanding data as it is (i.e: future frustrations and model fixing hinted).
EDA, EDA, EDA.
Have fun EDA-ing, and in that process: always cross-referencing the inputs you received from business owners. By pitching one against another, you would be closer to unlocking business insights.
Pitfall #3: Ignoring the business’s naming conventions / jargons…
What is that??
In a perfect world there is no complex jargons in each business domains and we DA can live happily ever after.
Yeah, that’s not the case. Each business domain will have their own jargons, particular to their company culture and/or domain, or in many cases, those naming conventions originated from the systems that business use. In the process of data modeling and naming columns, we may run into the temptation of renaming stuff as per our understanding, or just stick with system generated column names. But that’s a mistake.
Well, there are two biggest setbacks with that approach:
You just added extra obstacles to change management process in implementing a data platform project. Business owners would simply not adapt to new way of naming things (it’s already a lot of work adjusting to new technologies / reports)
You make it extra hard for yourself, the data team, as well as business users to reference back to the applications that generated such data (which is also something we have to do quite often in data modeling process).
Always utilize the business ways of naming things to your advantage. Once you can communicate with business owners in their language, things get easier: getting insights (people are more willing to share what they know about the business workflows and practices), increased likelihood to adapt to new dashboards, and easier reference between data models and systems.
Perhaps utilize data dictionary tools to map naming in system and naming in business convention.
Stick to business naming in column names. It’s especially helpful in the case of self-service approach, where non-tech business users want to explore datasets on their own.
Pitfall #4: Not cross-checking model results with data sources
In data modeling, aside from running tests to ensure data quality, one important thing to ensure is the final aggregated results match with data sources.
Not cross-checking model results with data sources is typically the easiest way to warrant the scenario of “hey, the number is wrong” Failing to do this prior to deliverable handover would erode the trust of business owners in your reports (and all the hard work behind it of whole data team).
Below are several lessons I’ve learned along the way:
At each data modeling layer (raw >> base/ staging): always take 1 or 2 records and cross check with the source database, and also the UI. This helps with:
Understand how data is populated and used on the front-end, and also how users typically interact with data, which fields are most important to replicate in your models, which fields can be ignored
Detect any data pipeline problems (null records generated by pipeline tool, not appropriate method of pulling / syncing data, etc.)
At final aggregated modeling layers (usually data marts) and dashboards: do not skip cross-checking with the old reports, and make sure to be able to account for any discrepancies.
Pitfall #5: Not accounting for all the stakeholders
There are several ways that lead to “not-so-constructive feedback” scenarios:
Organizations that have complex structure with overlapping business functions. Thus in project engagement, it’s often the C-level and VP level that are tasked with ‘being on board’ for the data project. But often, the key stakeholders (the ones who actually will use the reports in their daily work) may have very different ideas from those senior management level on certain aspects, like: metric definitions, what is possible, what is important Hence, if failing to include those key users from the start, you may very well end up with “this dashboard is UNUSABLE”.
Each department has their own way of defining a metrics, and the business has never had a therapy session to settle the differences. Hence, data team often becomes the bridge of absorbing and communicating such differences to each side. If failing to get everyone on board on certain basic metric definitions, you may end up with “this is wrong!”.
So, as you can see, the common theme here is “inclusivity”. Failing to include the right business stakeholders in the project, not only will our data modeling be affected, the whole project might be at risk.
While there are certain business aspects that we cannot change, there are always some things we can do to increase the likelihood of data project being a change that is welcome and adopted by the business.
Identify the key stakeholders early on: they can be the future users of the report, the person who created the old reports, team leaders, managers (having their endorsements often help with introducing the change to end users). Don’t just stick with senior management’s ideas about the deliverable expectations. As we gather opinions from all, it’s often the case that separate user stories can be developed for each target user group and we can help bring to their attentions the discrepancy in focus (if any), to close the gap and have more tangible ideas about dashboard contents.
Engage them in discussions/ meetings/ emails: Being actively included in progress reports and development of the dashboards… would help ease the adoption of new technologies for the users. It also helps DA in the training sessions tremendously because main challenges in adopting new BI tools, etc. have already been identified and addressed in those engagements. On the other hand, if anything is going off track (like an outdated budget file, or business policy changes that affect data model logic), would be promptly addressed thanks to including key stakeholders in the loop.
Sometimes, catering for the differences is a good idea. In the cases where metrics definitions just cannot be settled among different departments, it’s probably for a good reason. Though it’s ideal to develop a universal set of metrics that is agreed upon by everybody, it can also be the case that such agreement cannot be reached. In those cases, make sure to be inclusive in the metrics library, and have a clear naming convention for such metrics that are defined differently by each department. That way, (plus good documentation), the business users will not have to figure out how a metric is calculated so much (or give up and claiming it to be wrong), you can just direct them to the documentation link.
Well that is it. Too long.