Watch the Video
Loading SAP CDC Data into a GDPR-Compliant Data Vault
When managing change data capture (CDC) data from SAP in a Raw Data Vault, special considerations are needed for both CDC information and GDPR-relevant personal data. This post will cover how to model CDC data in a Data Vault, including the unique handling of created, updated, and deleted records. We’ll also discuss best practices for splitting data into separate satellites to manage GDPR-compliant attributes, including empty columns and privacy concerns.
This content is based on a discussion led by Michael Olschimke, CEO of Scalefree, during a Data Vault Q&A session.
In this article:
CDC Data Modeling in the Data Vault
The primary challenge with CDC data is that it only includes changes from SAP, not the full dataset each time. CDC data typically includes metadata on whether a record was created, updated, or deleted in SAP. Here’s a look at how to approach modeling this data in a Data Vault:
1. Load CDC Data in Satellites with Adjusted Patterns
In Data Vault, the data model remains unchanged, regardless of how the data is delivered (batch, CDC, or real-time). However, the loading pattern for CDC data into satellites needs some adjustments:
- Delta Check Adjustment: Normally, the Data Vault delta check identifies changes before loading data into a satellite. With CDC data, the changes are already captured, so this step can be bypassed. Instead, all changes from CDC data are loaded into the appropriate satellites directly.
- Change Impact Across Satellites: When there’s a change in one attribute, it triggers an update in all relevant satellites. While this approach can create non-delta records, the impact is typically minimal, and the redundant data can be compressed for storage efficiency.
Handling GDPR-Relevant Personal Data
CDC data often includes both regular attributes and GDPR-sensitive personal data. In the Data Vault, personal data attributes should be separated based on privacy and security classes to ensure compliance and manage access. Here’s the recommended approach:
2. Splitting Satellites Based on Privacy Classifications
For GDPR compliance, split CDC data into multiple satellites:
- Personal Data Satellite: A dedicated satellite for GDPR-relevant attributes (such as names or IDs). This separation allows for tighter security and privacy control.
- Non-Personal Data Satellite: General attributes with no privacy concerns go into a separate satellite to reduce the risk of exposure.
- Additional Splits: Further splits may be required based on rate of change, security levels, or business context, depending on the specific needs of your organization.
Maintaining separate satellites for different classes of data ensures that personal information is handled with stricter privacy controls, helping your data architecture comply with GDPR requirements.
Managing Empty Columns in the Data Vault
It’s common for source tables to contain columns that are always empty. When working with CDC data in a Data Vault:
- Include Empty Columns for Auditing: To retain full traceability and audibility, include empty columns in the satellite. This preserves the exact structure of the source data without altering it.
- Consider Separate “Unused Data” Satellite: If there are many empty columns, these can be grouped into a dedicated satellite, making the primary satellites leaner for users.
This approach allows for flexibility if the data in these columns becomes relevant in the future. Auditors will appreciate the comprehensive structure, and the Data Vault will retain all source data in its original form.
Example Satellite Structure
With GDPR compliance and CDC loading adjustments in mind, here’s an example structure for splitting SAP CDC data into satellites:
Satellite: CDC_Personal_Data - Attributes: GDPR-relevant data (e.g., personal names, social security numbers) - Metadata: Load date, source, change type (create, update, delete) - Purpose: Privacy-controlled access Satellite: CDC_NonPersonal_Data - Attributes: Non-personal data columns - Metadata: Load date, source, change type - Purpose: General access Satellite: CDC_Unused_Columns - Attributes: Columns always empty in the source table - Metadata: Load date, source - Purpose: Compliance and future-proofing
Best Practices for Satellite Splitting in the Data Vault
When splitting data into satellites, follow these best practices:
- Split by Privacy and Security: Ensure that personal and non-personal data are stored separately, particularly when handling GDPR-relevant information.
- Split by Source System: Keep different source systems in separate satellites for clarity and maintainability.
- Consider Business Needs: If certain data attributes are only relevant to specific business cases, split them accordingly to reduce satellite complexity.
These principles provide a clean, secure, and compliant Data Vault structure that enables efficient data retrieval, flexibility, and regulatory adherence.
Conclusion
Modeling SAP CDC data in a GDPR-compliant Data Vault involves adjustments to loading patterns, especially when dealing with CDC deltas and GDPR-sensitive data. By separating data based on privacy classes and including empty columns where necessary, you can ensure compliance and maintain a flexible data model. The approach outlined here simplifies the handling of CDC data, while providing robust auditing and privacy control.
Meet the Speaker
Michael Olschimke
Michael has more than 15 years of experience in Information Technology. During the last eight years he has specialized in Business Intelligence topics such as OLAP, Dimensional Modelling, and Data Mining. Challenge him with your questions!