The new General Data Protection Regulation (GDPR) is a law by the European Union (EU) and became effective on May 25, 2018. This new regulation is designed to put a high level of protection to personal data of European citizens, what means that companies around the world have to establish transparency and ownership to the individuals’ data and need to get a clear declaration of consent from them to save and process their personal data. Though laws from countries outside the EU (especially the USA) tend to favor business over consumer, GDPR affects all companies over the world who have personal data from EU-citizens in their database.
WHAT IS NEW WITH GDPR?
To be careful with personal data is nothing new, especially not in the EU. The key change of collecting and processing personal data is that the data is now completely under control of the owner, who can force the companies to delete or anonymize their data or to request copies of all owners personal data stored in the system. Personal data or Privately Identifiable Information (PII) means data, an individual can be identified with, e.g. name, phone number or email address.
Furthermore, it is now not enough to announce the intention of saving and processing in the general agreements of the company (which have to be accepted when i.e. sign up at company to purchase products), but the consent of collecting and processing personal data has to be clearly declared separately and accepted. The penalties when breaching the regulation are very strong and can be fined up to 20 million Euro or with 4 % of the annual revenue (depends on what is higher).
DATA VAULT 2.0 AS A SOLUTION
The easiest way would be to delete and block all people from the EU, what could be a solution for companies who don’t have a customer stack in the EU, but not for the most companies in the world. Another approach is to mask the data, what means to anonymize the PII data. Deleting/masking data afterwards does not just affect the Data Vault, but also all layer in the data warehouse: the data/information marts (if not virtualized), staging area, source systems and all other data storages, like user spaces (due to managed Self-Service BI), NoSQL/Hadoop stores, data exports, data backups(!) and even files without a data model e.g. Docs, PDFs or Spreadsheets to just name a few. This means that data lineage becomes absolutely vital.
Data Vault 2.0 with its complete auditable solution can definitely help you to reduce costs for deleting and masking processes. There are some modeling concepts which make it much easier to handle the requirements of GDPR. One approach is to split personal data and non-personal data in different Satellites (Satellite Split), as shown in the figure below.
The unique business key (and its hashed value) is stored in the hub with the reference to two Satellites via the Hash Keys. One Satellite contains non-personal data, the other one personal data. If the customer wants to get his personal data deleted, the complete row(s) in the Satellite with personal data is affected only (no column separated deletion in one table). This approach only works when the Business Key itself does not contain PII data and can be kept in the Hub to tie the descriptive data back to a business object. If the Business Key contains PII information (e.g. the email address), another approach has to be used to handle this situation. The modeling-solution in the image below shows that there is a central Hub with the unique Business Key of the customer (and its hashed value) which is connected to additional Hubs for several businesses, each via a Link entity. The Link contains the Business Key from the main customer Hub and an artificial key which is used as primary key in the additional Hub.
When customer data must be deleted for one business only and PII information is used as Business Key, just the Link entry and the descriptive attributes in the specific Satellite have to be deleted. The activity history is still available, can be used for analytical reasons and is not traceable to the customer itself. The additional advantage of this “business split” is when only one business is affected in case of deleting customer data, i.e. each business comes from different subsidiaries, and only the car insurance data must be deleted. Furthermore, keep in mind that deleting the Business Key only (and keep the Hash Key) does not result in GDPR compliance (and does not meet the Data Vault 2.0 standard anyway as the Business Key is used in link tables). The Hash Key in Data Vault 2.0 is not used to encrypt data, but for performance reasons. The key in the Links and the business driven Hubs, as we are talking about, can not be calculated back as it is a complete surrogate key. As soon as the customer wants to be deleted completely as he/she is no longer a customer in any of your business, you delete the record from the main Hub as well.
Otherwise, if there is no additional artificial key for the customer, after deleting PII data, you can not tie your data back to an object (an anchor point), what makes them (in many cases) useless.
Get Updates and Support
Please send inquiries and feature requests to [email protected].
For Data Vault training and on-site training inquiries, please contact [email protected] or register at www.scalefree.com.
To support the creation of Visual Data Vault drawings in Microsoft Visio, a stencil is implemented that can be used to draw Data Vault models. The stencil is available at www.visualdatavault.com.