Watch the Video
Reference Data
Welcome to another episode of Data Vault Friday! I’m Michael Olschimke, CEO of Scalefree. Today, I’m joining you from Munich, fresh off the TDWI conference. Despite the change in location, our mission remains the same: to answer your data-related questions every Friday at 11 o’clock Central European Summer Time. Whether your queries are about Data Vault, data mining, cloud computing, MPP database computing, or any data-driven topic, we’re here to help.
Today’s question comes from our online form, and it’s about deriving dimensions from reference data in the raw Data Vault. Specifically, the questioner has several lookup reference tables that they add as Hub or reference tables while creating the raw Data Vault. For example, they have a region table that includes a region code, description, language code, and valid from/to dates.
In this article:
Understanding Degenerate Dimensions
The query mentions a “degenerated dimension.” To clarify, a degenerate dimension is a dimension attribute, such as the region code, included in a fact table without any additional descriptions. This attribute exists within the fact table itself and doesn’t have a separate dimension table.
Building the Model
To illustrate this, let’s start with a basic structure. Imagine you have a non-historized link containing transaction data, and it references Hubs such as Hub Customer and Hub Product. Additionally, you have reference tables, like a region reference table with a region code and associated descriptions. Here’s a simplified model:
- Non-historized Link: Contains transaction data.
- Hub Customer and Hub Product: Reference customer and product data.
- Reference Hub for Region: Contains the region code.
- Reference Satellite for Region: Contains the descriptions, language codes, and valid dates.
This setup allows for capturing changes in reference data, making the model auditable and maintaining historical accuracy.
Creating a Degenerate Dimension
To create a degenerate dimension from the reference data, follow these steps:
- Include the Code in the Fact Table: Add the region code directly into your transaction data (the non-historized link).
- Determine the Required Attributes: Decide if you only need the region code or additional attributes like the region name.
- Create a Fact View: If you only need the region code, simply create a fact view that includes this code.
- Prejoin Additional Attributes: If you need additional attributes, prejoin the reference Hub and Satellite to get the region name or other details based on the timeline of your facts.
Handling Time-Based Data
When dealing with time-based data, it’s essential to identify the correct version of your reference data. If you want the latest description of the region (a Type 1 dimension), you can join the latest entry. For a Type 2 dimension (tracking changes over time), join based on the fact timestamp to match the correct version of the region name.
Performance Considerations
Reference tables typically contain a relatively small amount of data, which allows most joins to be efficient. However, if performance becomes an issue, you can consider creating a Point-in-Time (PIT) table in the business vault. This table can precompute the current description for each region on a daily basis, making joins faster and more efficient.
Conformed Dimensions
If you prefer to use a conformed dimension, convert your reference table into a dimension table. Use the primary key of the reference table (e.g., region code) as the dimension identifier. This approach involves joining the reference Hub and Satellite to create a dimension view that can be used in your fact tables.
Implementation Steps
- Turn Reference Table into Dimension: Join the reference Hub and Satellite to create a dimension view.
- Use Reference Code as Dimension Key: The region code becomes the dimension key.
- Create Fact View: Include the dimension key in your fact view and join the necessary attributes from the dimension view.
- Configure in Dashboard: Set up relationships between your facts and dimensions in your dashboard application for seamless data visualization.
Conclusion
In summary, deriving dimensions from reference data in a Data Vault involves understanding your needs for degenerate or conformed dimensions, handling time-based data appropriately, and ensuring efficient joins. By following these steps, you can create a robust and scalable data model that meets your analytical needs.
Thank you for joining us for this Data Vault Friday session. If you have more questions, submit them at sfr.ee/dvfriday. For additional learning, check out our webinars at Scalefree.to/webinars. If you need answers before next Friday, visit the Data Vault Innovators community we set up with Ignition Data.
Until next time, keep those data questions coming, and we’ll see you next Friday!
Meet the Speaker
Michael Olschimke
Michael has more than 15 years of experience in Information Technology. During the last eight years he has specialized in Business Intelligence topics such as OLAP, Dimensional Modelling, and Data Mining. Challenge him with your questions!
Get Updates and Support
Please send inquiries and feature requests to [email protected].
For Data Vault training and on-site training inquiries, please contact [email protected] or register at www.scalefree.com.
To support the creation of Visual Data Vault drawings in Microsoft Visio, a stencil is implemented that can be used to draw Data Vault models. The stencil is available at www.visualdatavault.com.