Das Problem
Our question comes from a project where the source system occasionally delivers new columns for existing tables. When these columns are added to a satellite, the hash difference (hashdiff) calculation changes. As a result, new deltas are generated for all business keys during the next load—even if the actual data hasn’t changed. The manual recalculation of hashdiffs for historical records is time-consuming and prone to errors. Can this be avoided?
In diesem Artikel:
Understanding Hashdiff Changes
The hashdiff is a critical component in a Data Vault model, used to detect changes in descriptive attributes. Adding a new column changes the hashdiff logic, potentially creating unnecessary deltas, which consume additional storage and complicate data integrity checks. Let’s break this down:
- When a new column is introduced, historical records often have
NULL
values for that column. - The updated hashdiff logic incorporates the new column, even if its value doesn’t contribute to meaningful changes.
- This can result in false positives—new records that aren’t genuinely different.
Potential Solutions
There are several strategies to handle this scenario, each with varying levels of manual effort and maintenance:
1. Recalculating Hashdiffs Manually
One approach is to manually recalculate the hashdiff for all existing records. While effective, this method requires significant effort and is not scalable for large datasets. Additionally, updating historical records can disrupt the auditability of your Data Vault.
2. Minimizing Updates with Targeted Recalculation
A more focused strategy is to update only the current records in the satellite (those with an open-ended load date). These records are actively used for comparisons and would benefit most from updated hashdiffs. While this reduces the number of updates, it still involves manual intervention.
3. Ensuring Hashdiff Consistency Automatically
The most efficient solution is to design the hashdiff calculation to remain consistent, even when structural changes occur:
- **Add Columns Only at the End:** Ensure new columns are appended to the end of the table structure.
- **Ignore Trailing Nulls:** Use a function like
RTRIM
to remove trailing delimiters caused byNULL
values. This keeps the hashdiff consistent when new columns are empty for historical records.
This approach eliminates the need for manual updates, provided that all structural changes adhere to these guidelines.
Praktisches Beispiel
Consider a satellite linked to a company hub, containing records for a company’s name and address. Initially, the hashdiff calculation includes only the company name and address. When a new column, postal code, is added:
- Historical records will have
NULL
values for postal code. - Using the
RTRIM
function ensures that the new column does not affect the hashdiff for these records.
This prevents unnecessary deltas, saving storage space and reducing maintenance overhead.
Handling Duplicates
Another question we received involved handling hard duplicates—records that are identical in every aspect, including hashdiff values. The recommended approach is to:
- Move such duplicates into an Error Mart for auditability.
- Fix pipeline issues if duplicates are caused by ingestion errors.
- For soft duplicates (e.g., intraday changes), manipulate the load timestamp by adding microseconds based on sequence IDs to ensure unique records.
Schlussfolgerung
By adopting a thoughtful approach to hashdiff calculation, you can minimize manual maintenance, ensure data consistency, and optimize storage in your Data Vault model. Whether you choose to recalculate selectively or implement hashdiff logic that handles changes automatically, the goal is the same: maintain the integrity of your data warehouse while reducing unnecessary effort.
Das Video ansehen
Treffen mit dem Sprecher
Julian Brunner
Senior Consultant
Julian Brunner arbeitet als Senior Consultant bei Scalefree und hat Wirtschaftsinformatik und Betriebswirtschaft studiert. Seine Schwerpunkte liegen in den Bereichen Business Intelligence, Data Warehousing und Data Vault 2.0. Als zertifizierter Data Vault 2.0 Practitioner hat er über 5 Jahre Erfahrung in der Entwicklung von Datenplattformen, insbesondere mit der Data Vault 2.0-Methodik. Er hat erfolgreich Kunden aus verschiedenen Sektoren wie Banken und Fertigung beraten.