Das Video ansehen
Identifying Non-Historized Links in Data Vault
In data management, particularly in the Data Vault 2.0 methodology, understanding how to handle non-historized links can be essential for maintaining accurate, traceable records. In today’s Data Vault Friday, we’ll cover the concept of non-historized links, explore the unit of work in data storage, and discuss which elements should be included in link hashes.
In diesem Artikel:
Background: Understanding the GL Table
Imagine we’re working with a General Ledger (GL) table for storing financial transactions, which includes over a billion rows. Each record in this table represents a transaction and includes attributes such as:
- TransactionID: A unique identifier for each transaction.
- AccountSegment: A business key identifying the account involved.
- AccountingDocument: Another business key associated with the document related to the transaction.
- AccountingPeriod: A reference key indicating the period of the transaction.
- DebitAmount and CreditAmount: The monetary amounts for each transaction.
- TransactionTimestamp: The exact date and time of the transaction.
While the TransactionID
alone uniquely identifies each transaction, a combination of AccountSegment
, AccountingDocument
, AccountingPeriod
und TransactionTimestamp
also provides a unique identifier for each entry. This setup raises questions about how to best organize, hash, and manage these records within a Data Vault model, specifically in terms of defining a Unit of Work and deciding which values should be included in a link hash.
Defining the Unit of Work in Data Vault
The Unit of Work is a concept used in Data Vault to determine which business keys need to be grouped together in a single record. In this context, we’re focusing on how these business keys—AccountSegment
, AccountingDocument
, AccountingPeriod
und TransactionTimestamp
—should be structured and managed within a non-historized link.
According to Michael Olschimke, the CEO of Scalefree, the Unit of Work is primarily relevant when dealing with standard relationship-based links. In our example, there’s no need to split the link between these business keys because the link can capture the original granularity of each transaction from the source data. This means that by keeping all four attributes within the same link, we avoid disrupting the Unit of Work and ensure a coherent data structure.
Testing the Structure of the Unit of Work
One way to verify if we’re correctly applying the Unit of Work is to check if the data model enables us to reconstruct the original data source without loss of records or attributes. If we find that we can reconstruct the data accurately using the non-historized link and potentially a Satellite table later, then we’re likely adhering to the Unit of Work correctly.
Since the attributes in our case uniquely identify each transaction, we can assume the Unit of Work is preserved, allowing for a stable and consistent structure in the data model.
Which Values Should Be Hashed in the Link Hash?
The next question concerns which values we should include in the link hash. Generally, the hash key in a link table is derived from the business keys of referenced hubs. In this case, the primary question is whether to hash just the AccountSegment
, AccountingDocument
und AccountingPeriod
, or if we should also include the TransactionID
.
Olschimke suggests a practical approach here: while we could technically get by with hashing either the TransactionID
alone or the combination of the other three elements, it’s often better to include all relevant attributes in the hash. This includes the TransactionID
along with AccountSegment
, AccountingDocument
und AccountingPeriod
.
Hier ist der Grund dafür:
- Consistency with Automation Tools: Many Data Vault automation tools automatically include all hub references and their business keys in the hash computation. Following this approach aligns with standard automation practices, making it easier to work with automation tools later.
- Minimizing Potential Issues: Including all attributes in the hash computation reduces the risk of data loss or accidental data duplication. It ensures that our hashes accurately represent each unique record.
- Negligible Performance Impact: While adding an extra attribute to the hash computation may slightly increase the string length, the performance impact is minimal. In large-scale data processing, this small change typically does not result in significant slowdowns.
In Practice: Setting Up Non-Historized Links
In Data Vault, non-historized links are a powerful tool for managing large datasets without versioning each record. By including all four attributes in the hash computation, we establish a stable link structure that reflects the source data’s original granularity. This approach allows us to confidently manage and retrieve accurate records without fear of unintentional data loss.
For practitioners, setting up these non-historized links can be straightforward when using automation tools that handle most of the heavy lifting. It’s generally recommended to follow the automation tools’ conventions, especially for larger datasets, as this minimizes discrepancies and facilitates smoother data processing in the long run.
How to Submit Your Questions
If you’re working with Data Vault and have a question similar to this one, Data Vault Friday sessions provide an opportunity to get expert insights. To submit a question, you can use the form, which allows you to attach diagrams, whiteboard images, or other supporting materials for clarity. Alternatively, for faster responses, you can post questions on the Data Innovators Exchange, where consultants actively respond and discuss Data Vault topics.
For those interested in more in-depth training, Scalefree also offers webinars, including sessions on WhereScape and dbt, which run monthly. These provide valuable resources for anyone looking to deepen their Data Vault expertise.
Final Thoughts
Identifying non-historized links and deciding on a consistent hashing approach can significantly impact data integrity and model reliability. By adhering to Data Vault principles and utilizing automation tools effectively, you can manage vast amounts of data without sacrificing accuracy or consistency.
With the right approach, you’ll be able to create a resilient, scalable data model that aligns with your organization’s needs and remains flexible for future changes. If you’re diving deeper into Data Vault, stay tuned for more Data Vault Friday insights and don’t hesitate to participate in the community for ongoing support and expertise.
Treffen mit dem Sprecher
Michael Olschimke
Michael hat mehr als 15 Jahre Erfahrung in der Informationstechnologie. In den letzten acht Jahren hat er sich auf Business Intelligence Themen wie OLAP, Dimensional Modelling und Data Mining spezialisiert. Fordern Sie ihn mit Ihren Fragen heraus!