Skip to main content
search
0

Watch the Video

In data management, particularly in the Data Vault 2.0 methodology, understanding how to handle non-historized links can be essential for maintaining accurate, traceable records. In today’s Data Vault Friday, we’ll cover the concept of non-historized links, explore the unit of work in data storage, and discuss which elements should be included in link hashes.



Background: Understanding the GL Table

Imagine we’re working with a General Ledger (GL) table for storing financial transactions, which includes over a billion rows. Each record in this table represents a transaction and includes attributes such as:

  • TransactionID: A unique identifier for each transaction.
  • AccountSegment: A business key identifying the account involved.
  • AccountingDocument: Another business key associated with the document related to the transaction.
  • AccountingPeriod: A reference key indicating the period of the transaction.
  • DebitAmount and CreditAmount: The monetary amounts for each transaction.
  • TransactionTimestamp: The exact date and time of the transaction.

While the TransactionID alone uniquely identifies each transaction, a combination of AccountSegment, AccountingDocument, AccountingPeriod, and TransactionTimestamp also provides a unique identifier for each entry. This setup raises questions about how to best organize, hash, and manage these records within a Data Vault model, specifically in terms of defining a Unit of Work and deciding which values should be included in a link hash.

Defining the Unit of Work in Data Vault

The Unit of Work is a concept used in Data Vault to determine which business keys need to be grouped together in a single record. In this context, we’re focusing on how these business keys—AccountSegment, AccountingDocument, AccountingPeriod, and TransactionTimestamp—should be structured and managed within a non-historized link.

According to Michael Olschimke, the CEO of Scalefree, the Unit of Work is primarily relevant when dealing with standard relationship-based links. In our example, there’s no need to split the link between these business keys because the link can capture the original granularity of each transaction from the source data. This means that by keeping all four attributes within the same link, we avoid disrupting the Unit of Work and ensure a coherent data structure.

Testing the Structure of the Unit of Work

One way to verify if we’re correctly applying the Unit of Work is to check if the data model enables us to reconstruct the original data source without loss of records or attributes. If we find that we can reconstruct the data accurately using the non-historized link and potentially a Satellite table later, then we’re likely adhering to the Unit of Work correctly.

Since the attributes in our case uniquely identify each transaction, we can assume the Unit of Work is preserved, allowing for a stable and consistent structure in the data model.

The next question concerns which values we should include in the link hash. Generally, the hash key in a link table is derived from the business keys of referenced hubs. In this case, the primary question is whether to hash just the AccountSegment, AccountingDocument, and AccountingPeriod, or if we should also include the TransactionID.

Olschimke suggests a practical approach here: while we could technically get by with hashing either the TransactionID alone or the combination of the other three elements, it’s often better to include all relevant attributes in the hash. This includes the TransactionID along with AccountSegment, AccountingDocument, and AccountingPeriod.

Here’s why:

  1. Consistency with Automation Tools: Many Data Vault automation tools automatically include all hub references and their business keys in the hash computation. Following this approach aligns with standard automation practices, making it easier to work with automation tools later.
  2. Minimizing Potential Issues: Including all attributes in the hash computation reduces the risk of data loss or accidental data duplication. It ensures that our hashes accurately represent each unique record.
  3. Negligible Performance Impact: While adding an extra attribute to the hash computation may slightly increase the string length, the performance impact is minimal. In large-scale data processing, this small change typically does not result in significant slowdowns.

In Data Vault, non-historized links are a powerful tool for managing large datasets without versioning each record. By including all four attributes in the hash computation, we establish a stable link structure that reflects the source data’s original granularity. This approach allows us to confidently manage and retrieve accurate records without fear of unintentional data loss.

For practitioners, setting up these non-historized links can be straightforward when using automation tools that handle most of the heavy lifting. It’s generally recommended to follow the automation tools’ conventions, especially for larger datasets, as this minimizes discrepancies and facilitates smoother data processing in the long run.

How to Submit Your Questions

If you’re working with Data Vault and have a question similar to this one, Data Vault Friday sessions provide an opportunity to get expert insights. To submit a question, you can use the form, which allows you to attach diagrams, whiteboard images, or other supporting materials for clarity. Alternatively, for faster responses, you can post questions on the Data Innovators Exchange, where consultants actively respond and discuss Data Vault topics.

For those interested in more in-depth training, Scalefree also offers webinars, including sessions on WhereScape and dbt, which run monthly. These provide valuable resources for anyone looking to deepen their Data Vault expertise.

Final Thoughts

Identifying non-historized links and deciding on a consistent hashing approach can significantly impact data integrity and model reliability. By adhering to Data Vault principles and utilizing automation tools effectively, you can manage vast amounts of data without sacrificing accuracy or consistency.

With the right approach, you’ll be able to create a resilient, scalable data model that aligns with your organization’s needs and remains flexible for future changes. If you’re diving deeper into Data Vault, stay tuned for more Data Vault Friday insights and don’t hesitate to participate in the community for ongoing support and expertise.

Meet the Speaker

Profile picture of Michael Olschimke

Michael Olschimke

Michael has more than 15 years of experience in Information Technology. During the last eight years he has specialized in Business Intelligence topics such as OLAP, Dimensional Modelling, and Data Mining. Challenge him with your questions!

Leave a Reply

Close Menu