Das Video ansehen
Single Point of Facts in Data Lakehouse Architecture
Welcome to another edition of DataVault Friday! Today, we’re diving into a frequently debated topic in data management: determining the “source of truth” in a data lakehouse architecture. Specifically, we’ll answer whether that source of truth resides in data lake files or in the tables of the Raw Vault and Business Vault. Additionally, we’ll address what to do if a bug in the ingestion framework requires a re-ingestion of data sources. Let’s explore these questions to better understand data lineage, data governance, and strategies for a reliable and flexible data ecosystem.
In diesem Artikel:
Understanding the “Single Point of Facts”
In traditional data warehousing, the idea of a “single version of the truth” is well-known. This concept implies that there is one version of the data that acts as the definitive source across an enterprise. For instance, a single “customer” or “product” definition applies universally within the organization.
However, in Data Vault architecture, we move from a “single version of the truth” to a “single point of facts.” The focus shifts from universal definitions to an unaltered, auditable record of events. Data Vault is designed to capture historical data accurately and reliably. It provides multiple perspectives on the data (versions of the truth) by isolating raw data from any business logic or transformations. This flexibility allows organizations to apply different business rules depending on context, while maintaining a consistent underlying dataset.
In this context, the Raw Vault is considered the foundational layer, capturing facts as they are, directly from the source systems. The Business Vault, on the other hand, introduces additional business rules, metrics, and aggregated data for reporting purposes. But in essence, the “single point of facts” remains within the Raw Vault because it represents an unaltered and auditable record.
Data Lakehouse Architecture and Points of Truth
In data lakehouse architecture, data is stored both in a data lake and within Data Vault tables. This raises the question: Which source is the ultimate truth? The data lake, with its raw files, or the Raw Vault tables?
The answer depends on the architectural requirements and the level of traceability and auditability needed. Ideally, both systems should mirror each other and serve as points of fact:
- Data Lake: The data lake serves as a repository for raw data files, often storing snapshots or full loads of data from source systems. This makes it easier to preserve the original data as-is without alteration.
- Raw Vault: In the Raw Vault, data is loaded into a structured schema, capturing the same original details but in a way that can be more systematically queried and analyzed. Like the data lake, the Raw Vault stores unmodified facts, but it also preserves lineage information, making it possible to reproduce deliveries and trace data transformations.
Since both layers should hold the same underlying data, they collectively represent the point of fact. Either the data lake or Raw Vault can serve as the truth source, depending on the scenario. This dual system ensures a resilient architecture, as data can be cross-validated across layers.
What If There’s a Bug in the Ingestion Framework?
One key question that arises is: What happens if there’s a bug in the ingestion framework? Bugs such as incorrect population of business keys or other erroneous transformations might require a complete re-ingestion of data sources.
When dealing with bugs in data ingestion, having both a data lake and a Raw Vault allows flexibility and safeguards. Here’s how to address these issues:
- Identify and Isolate the Problem: Pinpoint where the issue occurred in the ingestion process and document the scope of the bug, especially if it affects business keys or other critical aspects of data integrity.
- Rely on the Data Lake for Original Files: Since the data lake contains the original, unaltered data files, you can reload the affected data from here into the Raw Vault. This ensures that any corrupted or inaccurately transformed data can be replaced without loss.
- Reprocess the Raw Vault: With the correct data now available from the data lake, reload the Raw Vault. Ensure that new ingestion processes are thoroughly tested to avoid repeating the error.
- Automate Audits and Reconciliation: Implement automated reconciliation checks between the data lake and the Raw Vault. Automated scripts can flag discrepancies, giving early warning of issues before they reach production or reporting layers.
By leveraging both the data lake and Raw Vault as points of fact, the architecture remains robust and auditable. This redundancy allows for re-ingestion without significant downtime and ensures that data lineage remains traceable throughout the lifecycle.
Data Lake vs. Raw Vault: Which Is Easier for Reconstruction?
When it comes to reconstructing deliveries, the data lake often offers simplicity. Since the data lake can hold raw files with minimal transformation, data reconstruction is a straightforward matter of accessing the original files. In contrast, reconstructing from the Raw Vault requires additional effort, as data must be accurately joined across hubs, links, and satellites, while preserving the original state.
That said, both layers should be auditable, with logging mechanisms that allow for a traceable history of changes. Having a clear data lineage in place allows organizations to meet compliance and audit requirements while supporting accurate reporting.
Best Practices for Managing Points of Fact
While it’s tempting to designate a single point of fact, the dual-layered approach with a data lake and Raw Vault provides a more resilient framework. Here are some best practices for managing points of fact in a data lakehouse architecture:
- Maintain Consistency Between Layers: Ensure that data lake files and Raw Vault tables match exactly. Automate reconciliation checks between these layers to verify data integrity.
- Implement Auditable Ingestion Processes: Document all transformations from the data lake to the Raw Vault, with logging and error-checking mechanisms. This allows for easier tracing of issues if they arise.
- Retain Original Data in the Data Lake: Always keep a copy of original files in the data lake. These files provide a reliable source of truth that can be referenced or reloaded into the Raw Vault if issues occur.
- Leverage Metadata for Automation: Metadata can streamline both ingestion and reconciliation. Use metadata to define business keys, relationships, and descriptive data in the Raw Vault, while automating verification processes.
With these practices, data lakehouse architecture can be made robust, auditable, and resilient to changes or errors. By treating both the data lake and Raw Vault as points of fact, you ensure that your data ecosystem remains flexible, trustworthy, and ready to meet evolving business requirements.
Schlussfolgerung
The question of “single point of facts” in data lakehouse architecture doesn’t have a straightforward answer. Both the data lake and the Raw Vault act as points of fact, each offering unique benefits in terms of auditability and reconstruction. By utilizing both, you create a highly resilient system capable of withstanding data issues while providing a comprehensive, consistent view of your data.
In summary, while the Raw Vault may traditionally serve as the “single point of facts,” using both the data lake and the Raw Vault as truth sources creates a flexible architecture that can accommodate re-ingestion, mitigate risks, and support accurate reporting. With this dual approach, your data lakehouse architecture becomes a reliable foundation for modern data needs.
Treffen mit dem Sprecher
Michael Olschimke
Michael hat mehr als 15 Jahre Erfahrung in der Informationstechnologie. In den letzten acht Jahren hat er sich auf Business Intelligence Themen wie OLAP, Dimensional Modelling und Data Mining spezialisiert. Fordern Sie ihn mit Ihren Fragen heraus!