At the core of every well functioning Business Intelligence system is a well functioning Data Warehouse that combines, aggregates and models the data that is consumed by the business almost instantaneously. A well designed Data Warehouse (DW) should scale up as needed, be resilient to failures and accommodate any and all data demands (usually from ETl/ELT) that will be thrown at it.
There are typically 3 kind if situations (processing spikes) that can create pressure on the Data Warehouse:
- Ingesting data volume spikes
- Internal processing data volume spikes
- Reporting data requests spikes
On top tolerating various data volumes, a well designed DW should contain the data with the necessary 6 dimensions of data quality. The data in such Data Warehouse should be:
Let’s look at these in greater detail…
|Accuracy||How closely does the granular data resemble reality?|
|Consistent||Do various pieces of information about the same subject matter match each other?|
|Complete||Does the data contain all of reality of a given subject matter?|
|Unique||Are there any duplication of the data?|
|Validity||How useful and usable is the information contained in the data for the business?|
|Timelines||Is the data fresh and covers the reasonably expected recent period of business operations?|
The 6 dimensions of data quality displayed above are quite important when designing a DW. This is why our team has a tried-and-tested Data Warehouse design that we can deploy within days. It is flexible to accommodate any industry or business as we had already tested it in:
- Consumer goods
SageData methodology is based on the segregation of the detailed data from the Compute Data and Business Data store. What we have, in a nutshell is a 4 stage data storage that first most of the Batch Data processing done by the businesses today.
Stage – this is the first level where the incremental data is loaded. Normally, a business would get the data for the last X days or all fresh data generated from the last data load.
DDS (Detail Data Store) – this level contains all of the historical data for the business. The data in the dds is expected to be de-duplicated, clean and complete. It the lowest level/granularity of the data that is available to the data analysts when they are analysing the business.
CDS (Compute Data Store) – at the compute data store level the data from dds is logically organised and aggregated across the dimensions that support business analysis. For example, a table containing all users or customers of a business may have CLV/LTV information computed and re-computed daily against each user so that Data Analysts can easily access accurate information fast. There can be as many as hundreds of metrics computed against a particular dimension and stored at the dimensional level.
BDS (Business Data Store) – the final layer of this high level Data Warehouse architecture that we at SageData like to use is called BDS and this layer contains the data for the business users who like to interact with the data themselves. Many reports are often driven by the data in BDS. Business Data store does not actually have to reside in the DW because it can also be in the data extracts, OLAP cubes and caches used to visualise the data.
The structure of the data in BDS would be:
- Clean (no test data)
- Combined (with other meaningful data)
- Aggregated (on a day level)
- Truncated (last X months/years)
- Calculated (reflecting the formulas from the KPI Library)
Bellow is an example of a simple diagram of a few entities that could live in a BDS. Note that we have CLV, count_good_billings, sales_net_amount_eur pre computed for dim_user dimension while the fact_order table has number of billings in order pre computed.
So, why do we like this approach and design?
The DDS, CDS and BDS design creates a separation between functional layers of data processing and allows for maximal resilience to operational failures at complete or partial level during data processing.
Let’s look at some benefits of this Data Warehouse design:
Fault Tolerance – the separation of the data between CDS and BDS means that when computational errors occur, and they do occur, no matter how hard the code is tested, the business reports will remain to be accurate because the BDS tables will not be reloaded until the CDS tables on which they depend have been properly computed.
Faster Reports – BDS would often have either aggregated, condensed and cleaned. This would make the reports that run on this data visualize it much faster.
Better Accuracy – This Data Warehouse design also has another benefit; the pre-calculated formula fields will (hopefully) reflect the formulas from the KPI Library and therefore add accuracy across reports.
Greater Granularity – The DDS schema contains the detailed data, about which, let’s be honest, not a lot of business stakeholder care much. This ocean of data will be hidden from the Data power users with access to BDS schema, yet available to data analysts for research and analysis.
Better Traceability – any inconsistencies or unusual occurrences in data can be traced back from BDS, through CDS to DDS and examined in details in a structured and orderly fashion by the data analysts. In addition to that, the Stage data store will have the data for the last data load for easy verification of the latest data.
The SageData team has used this design for years to build some of the most scalable and resilient Data Warehouses working under some of the most pressing data loads. Whether you are interested to improve your Attribution Modeling for Marketing, start making the first steps to becoming a data driven enterprise or improving your Business Intelligence infrastructure, our team can help you with any and all data questions.
Thanks for reading and may the data be with you!