Skip to main content
search
0
Scalefree Knowledge Webinars Data Vault Friday Maintaining the Hash Diff

The Problem

Our question comes from a project where the source system occasionally delivers new columns for existing tables. When these columns are added to a satellite, the hash difference (hashdiff) calculation changes. As a result, new deltas are generated for all business keys during the next load—even if the actual data hasn’t changed. The manual recalculation of hashdiffs for historical records is time-consuming and prone to errors. Can this be avoided?



Understanding Hashdiff Changes

The hashdiff is a critical component in a Data Vault model, used to detect changes in descriptive attributes. Adding a new column changes the hashdiff logic, potentially creating unnecessary deltas, which consume additional storage and complicate data integrity checks. Let’s break this down:

  • When a new column is introduced, historical records often have NULL values for that column.
  • The updated hashdiff logic incorporates the new column, even if its value doesn’t contribute to meaningful changes.
  • This can result in false positives—new records that aren’t genuinely different.

Potential Solutions

There are several strategies to handle this scenario, each with varying levels of manual effort and maintenance:

1. Recalculating Hashdiffs Manually

One approach is to manually recalculate the hashdiff for all existing records. While effective, this method requires significant effort and is not scalable for large datasets. Additionally, updating historical records can disrupt the auditability of your Data Vault.

2. Minimizing Updates with Targeted Recalculation

A more focused strategy is to update only the current records in the satellite (those with an open-ended load date). These records are actively used for comparisons and would benefit most from updated hashdiffs. While this reduces the number of updates, it still involves manual intervention.

3. Ensuring Hashdiff Consistency Automatically

The most efficient solution is to design the hashdiff calculation to remain consistent, even when structural changes occur:

  • **Add Columns Only at the End:** Ensure new columns are appended to the end of the table structure.
  • **Ignore Trailing Nulls:** Use a function like RTRIM to remove trailing delimiters caused by NULL values. This keeps the hashdiff consistent when new columns are empty for historical records.

This approach eliminates the need for manual updates, provided that all structural changes adhere to these guidelines.

Practical Example

Consider a satellite linked to a company hub, containing records for a company’s name and address. Initially, the hashdiff calculation includes only the company name and address. When a new column, postal code, is added:

  • Historical records will have NULL values for postal code.
  • Using the RTRIM function ensures that the new column does not affect the hashdiff for these records.

This prevents unnecessary deltas, saving storage space and reducing maintenance overhead.

Handling Duplicates

Another question we received involved handling hard duplicates—records that are identical in every aspect, including hashdiff values. The recommended approach is to:

  • Move such duplicates into an Error Mart for auditability.
  • Fix pipeline issues if duplicates are caused by ingestion errors.
  • For soft duplicates (e.g., intraday changes), manipulate the load timestamp by adding microseconds based on sequence IDs to ensure unique records.

Conclusion

By adopting a thoughtful approach to hashdiff calculation, you can minimize manual maintenance, ensure data consistency, and optimize storage in your Data Vault model. Whether you choose to recalculate selectively or implement hashdiff logic that handles changes automatically, the goal is the same: maintain the integrity of your data warehouse while reducing unnecessary effort.

Watch the Video

Meet the Speaker

Julian Brunner Senior Consultant

Julian Brunner
Senior Consultant

Julian Brunner is working as a Senior Consultant at Scalefree and studied Business Informatics and Business Administration. His main focus is on Business Intelligence, Data Warehousing and Data Vault 2.0. As a certified Data Vault 2.0 Practitioner he has over 5 years of experience in developing Data Platforms, especially with the Data Vault 2.0 methodology. He has successfully consulted customers from different sectors like banking and manufacturing.

Leave a Reply

Close Menu