Understanding Status Tracking Satellites in Data Vault
The integration of Change Data Capture (CDC) data into multi-active satellites and status tracking satellites is a nuanced topic. In a previous session, the focus was primarily on multi-active satellites, leaving status tracking satellites underexplored. This article will dive deeper into their utility, especially in the context of CDC data.
A status tracking satellite in Data Vault serves a specific purpose: it tracks the appearance, updates, and disappearance of business objects in the source system. However, if CDC data is available, this tracking becomes inherently simpler because CDC already provides explicit information about creates (C), updates (U), and deletes (D). Thus, creating a separate status tracking satellite may not be necessary.
In contrast, when dealing with full extracts (non-CDC data), a status tracking satellite can be invaluable. It enables you to derive creates, updates, and deletes by comparing consecutive extracts, identifying the first appearance (create), differences between records (update), and removal of records (delete). This can be achieved by maintaining a delta check mechanism and creating a robust satellite to store these events.
In diesem Artikel:
Handling Multi-Active Data in Status Tracking Satellites
Multi-active data arises when the same business key appears multiple times in the source system, distinguished by another attribute (e.g., customer ID). In these cases, status tracking satellites must accommodate the additional attributes, ensuring that individual records are not incorrectly marked as deleted when only one instance of the multi-active data changes.
For example, consider a scenario where a customer appears twice in the source system with different technical IDs but the same business key. A delete operation on one ID should not remove the customer from the source entirely. To address this, a status tracking satellite should maintain a composite key combining the business key and the multi-active attribute.
This approach ensures that changes are tracked at the appropriate granularity, preserving the integrity of multi-active records. Additionally, adding the CDC information (CUD columns) directly to the main satellite can simplify tracking without requiring a separate status tracking satellite.
Data Vault and Delta Lake: Complementary Approaches
The second question posed is whether Data Vault adds value when Delta Lake is already in use. To address this, it’s essential to understand the distinctions between the two. Delta Lake is a technology, whereas Data Vault is a methodology. While Delta Lake provides a robust framework for handling data in its native form (e.g., JSON, XML) and managing deltas, it does not prescribe how to model or process data for business purposes.
Data Vault, on the other hand, excels in its structured, agile methodology for modeling data. It provides a clear architecture, including hubs, links, and satellites, which organize data effectively for analytics and reporting. This is where Data Vault complements Delta Lake by applying a methodical approach to the data stored in the lake.
In practice, Delta Lake can serve as the persistent staging area (landing zone) in a Data Vault architecture. The metadata and delta tracking capabilities of Delta Lake enhance the efficiency of loading and processing data, while Data Vault ensures that the data is modeled and structured to meet business requirements. This synergy allows organizations to leverage the strengths of both technologies, creating a powerful data ecosystem.
Combining CDC Data with Data Vault and Delta Lake
By integrating CDC data, Delta Lake, and Data Vault, organizations can achieve an optimized data architecture. CDC data feeds directly into Delta Lake’s storage layers (bronze, silver, gold), which in turn populate the Data Vault’s hubs, links, and satellites. This integration streamlines data ingestion, transformation, and querying while maintaining flexibility and scalability.
For instance, CDC data can directly populate status tracking satellites or be included in a main satellite for simplicity. Meanwhile, Delta Lake’s metadata features support efficient querying and analysis, enabling the Data Vault layer to focus on applying business logic and producing meaningful insights.
By combining these tools and methodologies, data teams can build robust, agile data platforms that support modern analytics and decision-making needs.
Das Video ansehen
Treffen mit dem Sprecher
Marc Winkelmann
Geschäftsführender Berater
Marc arbeitet im Bereich Business Intelligence und Enterprise Data Warehousing (EDW) mit Schwerpunkt auf Data Vault 2.0-Implementierung und Coaching. Seit 2016 ist er in der Beratung und Implementierung von Data Vault 2.0-Lösungen bei Branchenführern in den Bereichen Fertigung, Energieversorgung und Facility Management tätig. Im Jahr 2020 wurde er zum Data Vault 2.0-Ausbilder für Scalefree ernannt.