Zum Hauptinhalt springen
Suche
0

“Big Data”, “Data Lake”, “Data Swamp”, “Hybride Architektur”, “NoSQL”, “Hadoop” … terms you are confronted with very often these days when you are dealing with data. Furthermore, the question comes up if you really need a data warehouse nowadays when you deal with a high variety and volume of data. We want to talk about what a data lake is, if we need a data warehouse when using NoSQL platforms like Hadoop, and how it is combined with Data Vault.

WHAT IS A DATA LAKE?

There is a proper definition from Tamara Dull (SAS): “A Datensee is a storage repository that holds a vast amount of raw data in its native format, including structured, Halbstrukturierteund unstructured data. The data structure and requirements are not defined until the data is needed.” 1

The last sentence, that the data structure and requirements are not defined until the data is needed, says that the structure is mapped when the data is queried from the data repository (Datensee), known as “schema on read”. The difference to traditional RDBMS (schema on write) is, that you don’t pre-define the structure of the data (files) itself when loading the data into the NoSQL database. This does not say, that structure is not necessary – on the contrary, the structure is very important and required to utilize a data lake. This is not defined directly in the data lake (NoSQL database), but when the location of the files is attached to a schema, executed when the data is read. If no structure is defined and you are using a NoSQL database like Hadoop – which is just a file storage – as a “landing zone”, it will become a data dump. Then you will come into the situation, that your data scientists or power users have to do a high effort when going to the dump and try to find things of value in it.

For example, with tools like Hive you can declare your query schema, and this by using MPP (Massive Parallel Processing) on a HDFS.

DO WE STILL NEED A DATA WAREHOUSE?

When you have a data lake with all your data, you might ask yourself the question if you need a data warehouse … or do you already have a data warehouse when having a data lake?

First, we have to compare the terms data lake and data warehouse. A data warehouse stores its data subject-oriented, time variant and integrated by business key. A data lake is not subject oriented when ingested, not integrated at all and cannot handle CDC (Change Data Capture) or deltas, because you can not update the content of files (you just add or replace an existing file). From the technology and capability perspective, there are also differences/changes  by using a traditional RDBMS (for data warehouses) and New SQL / NoSQL platforms (for data lakes).

Back to the question, if we need a data warehouse. The short answer is, it depends: If you don’t have an existing data warehouse and all you do is Data Science work or investigating your data, you probably don’t want or need a data warehouse. But, if you need structure, analysis, multi-integrated systems (integrated by consistent business keys), tie your data back to the business processes by business key, then you definitely need a data warehouse. In summary: when you want to extract value out of the data lake (the data in the data lake), a data warehouse is a mature concept.

When a data warehouse already exists, a best practice is to augment your existing data warehouse (RDBMS) by integrate it in the data lake or probably use a hybride Architektur by replacing the relational staging area by a HDFS based staging area which captures all unstructured and structured data.

1 https://www.kdnuggets.com/2015/09/data-lake-vs-data-warehouse-key-differences.html
adopted from https://www.smartdatacollective.com/big-data-cheat-sheet-what-executives-want-know/

Wie Sie Updates und Support erhalten

Bitte senden Sie Anfragen und Funktionswünsche an [email protected]

Managed Self-Service BI is part of the Data Vault 2.0 Boot Camps. Für Anfragen zu Data Vault-Schulungen und Schulungen vor Ort wenden Sie sich bitte an [email protected] oder registrieren Sie sich unter www.scalefree.com.

Um die Erstellung von Visual Data Vault-Zeichnungen in Microsoft Visio zu unterstützen, wurde eine Schablone implementiert, die zum Zeichnen von Data Vault-Modellen verwendet werden kann. Die Schablone ist erhältlich bei www.visualdatavault.com.

Eine Antwort hinterlassen

Menü schließen