BI tools choice and its impact on model planning
Sometimes well into the process of creating a dashboard, you may realize “uh oh, expected outcome isn’t compatible with current data model, due to some BI tool limitations”.
This article aims to explore some limits of different BI tools and what they mean for the data model planning stage (that should happen before creating the dashboard, but if not enough thoughts are put into this process, one may end up redoing a model several times to fit a desired outcome). Though data modeling is often an iterative process (as business use case may expands over time, more data is added to the warehouse, etc.) we do want to minimize reworking that result from poor planning.
While each BI tool has its unique strengths and limitations (because their use cases are different), this article is not about pitting one BI tool against another or to say ‘hey you should use this and ditch that’. There should be obvious reasons why your data solution has opted in a certain BI tool and not another, and it often comes down to data culture, technological readiness, and budget limit of the company adopting it. So the goal here is to know how to adapt to the limitations from data model planning step.
BI tools discussed
I have tried developing dashboards on PowerBI, Tableau, Metabase, Holistics, Looker, Google Data Studio. However my experience with Tableau was not as intensive as with the others to have meaningful insights in this matter, so we will discuss about the remaining tools in this article. Would be great to hear more ideas from expert Tableau users related to this topic.
Some typical problems I have run into in the past:
- The model is not denormalized enough to accommodate limited relationship options that a BI tool can support.
- The final visual outcome expected on the dashboard is more complicated than initially planned (business users change their request once in a while), and the data model created is not compatible with that expected outcome due to some limited visualization options of the BI tool.
- Some business KPIs that you thought could be created as measures on the BI tool for a specific dashboard, turn out to be commonly referred to in other use cases/ dashboards as well (assuming this wasn’t known to you at the initial planning stage). And as far as I know, only BI tools such as Looker/ Holistics offer the feature of creating & maintaining business metrics as code (LookML with view, explores and Git integration, Holistics with promised future release of AML and Git integration). So unless you use such BI tools, you end up recreating such metrics sparingly across different reports.
As those problems present itself, I gradually learned that there are some features of BI tool at use that should be taken into account at data model planning stage:
- Ability to create many-to-many relationship.
- Ability to “scale” your joins: not sure if this is a thing. But say your current model relationship is working fine, but one day a new dataset is introduced, a new relationship needs to be created, and when you add it, it creates conflict with current ongoing relationships. Both cannot be maintained active at the same time, sacrifice must be made, and data model tweaking required.
- Feature of building and maintaining centralized metrics as code.
- Availability of certain visualization chart/ transformation feature in BI tool.
Example use case
To easily discuss various scenarios, let’s say you need to create a dashboard that analyzes sales and inventory. One key focus the business has is:
- Where are the sales hot-spots (geographically)?
- Is the current logistics setup (distances between warehouses and sale hot-spots) currently matching up with such sale distribution?
And the data tables you have, with different granularities:
- Orders at level of OrderID, CustomerID (customer attributes include Country, State)
- Inventory at level of Warehouse, Product.
- Warehouse (Warehouse Address, State)
This mean somehow in your analysis, you should be able to form relationship between sales (Orders) and Inventory and Warehouses (via State dimension) to have an overview of the hotspots and dive deeper.
Let’s discuss in details.
For scenario of limited relationship in BI tool
Scenario 1 – unsupported many-to-many relationship
Suppose you are using a BI tool that doesn’t support many-to-many relationship just yet, like Holistics. In the use case above, you simply cannot create relationship directly between Inventory and Orders via State (as there are multiple warehouses in a State, and similar for Orders).
Solution: You have to create a
Location dimension table with State as primary key in either:
- your data mart beforehand, and use that table to form direct 1:n relationship with
- OR, use transform table feature in Holistics to create such derived table from
Warehouses. Luckily, this feature helps you using a derived table as an original independent table just fine without resulting in circular dependency error in further joining/ measure works (if you were to use similar in approach in PowerBI for instance, for some other reasons).
- Whichever source used to create such Location table, must generate all States available in both Inventory and Orders tables for the joins to be effective. The first approach is safe regardless of BI tool used, so you are future-proofing for possibility of switching BI tool as well. Also, in the spirit of encouraging centralized version control, as of now the first approach would be preferable too. Holistics does promise Analytic-as-code and Git integration feature in near future, so by then both approaches are probably equally practical.
Scenario 2 – unsupported cross-drill between Native query and Imported tables
Another scenario you may run into is when you try to create a question in Metabase using Native Query, which does not support forming relationship with imported tables. For Metabase, it’s probably the best route to:
- Really think through on the relationship you want to create among different tables, the kind of relationship required to answer which specific questions, and plan your model around it. This means there isn’t much room for flexibility in your dashboard creation process or spinoff in-depth analysis. But if you know your data well and have strong grip on the business user stories, this makes sense.
- Stick to consistent primary key, secondary key naming across tables (not a Metabase specific thing, but it’s best for Metabase use case because then it can then automatically pick up on such relationship for you).
- Create your tables in data warehouse instead of relying on Native Query as part of a dashboard that pulls data from multiple sources. Native query is best for single ad-hoc, standalone question use case.
For scenario of expanding your joins
If you expect business use case expansion (and you should), there should be some thoughts onto how best to organize the relationship you create among different tables for dashboard work.
But even despite our best effort, sometimes new data is introduced and it’s just inevitable to upgrade the whole relationship model. So it can be disheartening to see error or warning message about conflicting relationship due to such new introduction.
When new model must be introduced, say you were having the Orders and Customers table working in harmony before in your dashboard model, but then you introduce Inventory. Orders have ProductID and WarehouseID, and so does Inventory. It might be tempting to join these tables together on ProductID and WarehouseID for some hot-spot analysis mentioned above. But it might be the case that if you do so, it causes conflict to all other relationship already established. You might run into this problem using Metabase, PowerBI, Holistics, etc. This is not a BI tool specific problem.
Some general rule of thumbs I have learned along the way:
- Always prefer one-to-many relationship. It’s much easier to ensure correct aggregation than many-to-many relationship. Also it helps you avoid implicit relationship among the models that can be automatically detected and created (such as in PowerBI), and somehow result in unexpected result in your report.
- This means utilize dimension tables and use them as filters and dimensions for drill-through across your reports.
- Create surrogate key for joining. Like in our use case: concat(WarehouseID, ProductID), create 1:n relationship with Warehouse and Product tables respectively. This can be done by utilizing ‘creating custom dimension’ or ‘creating column’ feature on your BI tool, or can be done at data modeling layer.
Beware of null values in your concatenate fields. It can happen, and you should resolve such case in concatenated key for correct aggregation results on dashboard.
For centralized business metrics
More BI tools these days are developed with this intention in mind, like Looker’s LookML and Holistics’ AML, which is awesome and allow maintaining single-source-of-truth possible.
But it might be the case your BI tool isn’t one of those, and you still want to at least incorporate version control and single source of truth for your business metrics. It would then make sense to create extra columns in your mart tables for such purpose.
- Stock balance: Sometimes the data source you have may not maintain the stock balance as table and instead just have stock movement in it, so you either need to rely on its ‘views’ to pull stock balance from (if any), or recalculate it yourself in data warehouse. This is a must-maintain-item in data mart, instead of relying on BI tool calculated measure. It’s not best practice because you can’t bring that logic and apply it elsewhere (in another dashboard).
- Running average: Sometimes there is shared needs among different department users to compare their KPI against a common calculated metrics such as running average (of revenue, of order count, etc.). So instead of recalculating it everywhere every time a new report is required, you could just embed it in the data modeling layer with an extra column in mart table, empower business users to drag and drop into their dashboards.
For limited transformation or visualization features
Lacking table transformation
Compared with other BI tools, Google Data Studio lacks the transformation feature that often come in handy in case of transposing rows-columns, or creating derived table, without touching SQL.
In this case, you can utilize Custom Query when integrating Data Studio with BigQuery, and do transformation in your SQL script. A bit sad for version controlling and centralized repo for your codes, but it works.
In this sense, Google Data Studio is a bit similar to Metabase in terms of model planning: you need to know business use case well and plan in advance to avoid abuse of custom queries, losing sight of ‘single source of truth’.
Lacking visualization option
In some BI tools, the option for cohort retention analysis chart or funnel analysis chart is limited, requiring some creative work with data modeling, like adding another table that best work with the visualization option you have (pre-made cohort retention data table then apply pivot table feature, for instance).
So depends on your business needs, you may have to adapt to some limitations of chosen BI tool by planning your data modeling accordingly. Some rules of thumbs that would be safe, regardless of BI tool used:
- Always have dimension tables for all drilling attributes you need from data mart.
- Adopt one-to-many relationship between dimension and fact tables, avoid many-to-many relationship as a future-proof approach (new tables added with new relationship required that can break current model, or switching BI tool that may not support such feature).
- Centralize business metrics as much as possible with codes (either via keeping common measures as column in data mart tables, or using tools with Git integration). Avoid reliance on BI tool to create such measures.
- Establish clear business requirements in advance, having deep insights in the direction business is heading also works in your favor at model planning stage.