Multi-Active Satellites in Data Vault 2.0
With multi-active satellites, you’re able to store multiple active records for one business key. Depending on how the data arrives from your source, there are different ways to implement multi-activity in Data Vault 2.0. In this post, we’ll explain your options for modeling.
What are Multi-Active Satellites?
Multi-active satellites are similar to standard satellites and their structure. As said before, they store multiple active records per key at a point in time. This exact structure depends on the use case though.
See the example Data Vault model in Figure 1.
Figure 1: Data Vault model
When to use Multi-Active Satellites?
For this reason, multi-active satellites are suitable for multiple active records. So, we use multi-active satellites when the source system delivers them. Furthermore, it is also appropriate for use when the delivery mechanism doesn’t provide enough metadata, for example by using XML.
An Example
Let’s say you get your customer’s data delivered in semi-structured XML files. Within it are several phone numbers in an array. However, the metadata doesn’t provide more information. So the number, type, and order of the records are unknown. When flattening the XML file, you will receive as many records as the number of tuples in the array (figure 2). If you create a standard satellite out of these, only one row per customer business key can be active at the same time. Otherwise, you would run into a primary key violation having the same hash key and ldts.
Figure 2 – customer.xml
Different Options for Implementation
To avoid these problems, and also be able to query the active phone numbers at any point in time, there are several options. These depend on whether a “multi-active key” is available though, in addition to performance issues and the usability of it when you query the data. In the following section, we’ll briefly describe them. That said, in the next post, we’re going to go more into detail.
Multi-Active Satellites Rows
In case you don’t have a multi-active key and you want to use multi-active satellites, you need to create your own one by adding a sub-sequence number per business key in the staging area. In addition to the hash key and ldts, you need to add this to the primary key (figure 3). Here, though, you have to pay close attention if the data changes in the next load. For example, if the order of the phone numbers changes, you cannot assign a specific phone number to a sub-sequence.
Figure 3 – Multi-active satellite with sub-sequence
Type
If it is possible to define type codes and add them to the source feed, the best approach is to add these to the primary key of the satellite. Here, you rarely need to change types as well. In our example, we would define types like ‘home’, ‘business’, ‘cell’, and ‘fax’ as shown in Figure 4. But remember, this must delivered by the source system. If not, you have to use the sub-sequences.
Figure 4 – Type code solution
Another option to avoid changing types requires pivoting in which you create different columns per type. Here, you receive the same granularity as the hub. But, while also considering the need for reengineering in case of changing types of attributes.
Extra Hub
The last way to solve issues regarding multi-active records is to create an extra, weaker hub for the multi-active attribute, in this case, the phone number, and link them to the main hub. This requires that this attribute qualifies as a business key. Here, the number of hubs may increase heavily depending on how many multi-active data sets per business key you have. Furthermore, the new hub might not present as a real business object. So, while this is not the first way to go, it is a feasible solution in some cases.
Add the Multi-Active Attribute to the Composite Key
It’s not recommended to simply add the multi-active key to the business key. This will change the original grain and the semantic meaning. Which makes the results in that the original intention of the hub disappears.
Keep these rules for a Raw Data Vault hub in mind:
- Don’t super- or subtype in a hub
- Don’t mix different semantic meanings in a hub
- Don’t mix granularities in a hub
Conclusion
Implementing multi-active satellites can be used in many data vault projects, especially when working with semi-structured data. That said, the best practice in the case of multi-active records is to work with type codes that are added to the source feed. However, when they aren’t available, you need to use the sub-sequence. Though, as a last solution you can create an extra hub.
In the next post, we’re going to talk more about the advantages and disadvantages of the options and how exactly to implement them. Topics will be delta-checks, ldts, PITs and we’ll talk about a specific hash function to calculate multi-row hashdiffs.
– by Marvin Geerken (Scalefree)