PII Business Keys
In modern data architecture, handling Personally Identifiable Information (PII) is a crucial aspect of maintaining data privacy and integrity. One common challenge in Data Vault modeling is determining how to properly load artificial hubs when using PII fields as business keys. In this article, we explore different approaches and best practices to handle this scenario effectively.
In this article:
Understanding the Loading Process for Artificial Hubs
When an artificial hub is created using a PII-based business key, a key question arises: should we load one UUID per person, or should we generate multiple UUIDs for each version or change? Additionally, how do we manage this in the ETL process?
Solution 1: Using the Technical ID as the Business Key
One approach is to integrate data using the technical ID (e.g., employee ID from a CRM system) instead of the PII-based business key. This method ensures that the actual PII data remains in the satellite, making deletions easier while maintaining integrity in data integration.
Solution 2: Storing Both Technical and Business Keys in the Hub
Another option is to load both the technical ID and the PII-based business key into the same hub. A same-as-link can be used to create a mapping between the two, allowing flexibility in identifying records while ensuring that the satellites reference the technical ID for consistency.
Solution 3: Separating Technical and Business Keys into Two Hubs
For greater flexibility, technical IDs and business keys can be stored in separate hubs with a linking mechanism. While this approach introduces additional complexity, it provides a structured way to manage mappings between different keys while keeping PII data separate.
Managing UUIDs in the ETL Process
If the source system does not provide a technical ID, an artificial UUID must be generated. This requires maintaining a lookup table in the staging layer to map each business key to a UUID. This mapping must be handled efficiently in the ETL process to ensure consistency across data loads.
Handling PII Deletions
When a delete request is received, it is essential to remove PII data while preserving relationships in the data model. Using the technical ID ensures that descriptive information remains, while direct identifiers are removed. Additionally, solutions like Delta Lake or Iceberg tables can help manage deletions effectively in a data lake environment.
Conclusion
Choosing the right approach for handling PII-based business keys depends on the specific use case and integration requirements. Using a technical ID simplifies integration but may not always be feasible. The same-as-link approach provides a balanced solution, while separate hubs offer greater flexibility at the cost of added complexity. Ultimately, a well-structured ETL process is key to ensuring data integrity and compliance with privacy regulations.
Watch the Video
Meet the Speaker

Marc Winkelmann
Managing Consultant
Marc is working in Business Intelligence and Enterprise Data Warehousing (EDW) with a focus on Data Vault 2.0 implementation and coaching. Since 2016 he is active in consulting and implementation of Data Vault 2.0 solutions with industry leaders in manufacturing, energy supply and facility management sector. In 2020 he became a Data Vault 2.0 Instructor for Scalefree.