Modeling Project Tasks and Actions in Data Vault

Modeling Project Tasks and Actions

In the world of Data Vault, creating effective data models to capture project tasks and actions is a crucial part of building scalable, efficient, and auditable data solutions. One of the challenges that data engineers face when building such models is ensuring that the history of project tasks and actions is captured correctly, while also accounting for the complexity of slowly changing data and continuous changes like those seen in Change Data Capture (CDC) environments.

In this article:

Understanding the Data Model
Challenges with the CDC Environment
Solution: Using a Non-Historized Link
- Handling Updates and Deletions
Building the Task Dimension
Optimizing the Approach
Conclusion
Watch the Video
Meet the Speaker

Understanding the Data Model

Let’s start by understanding the data model in question. We have three key tables:

Projects: This table contains details about various projects.
Tasks: This table contains tasks associated with each project. It has a foreign key linking back to the Projects table.
Actions: Each task can have one or more actions associated with it, and the Actions table has a foreign key linking back to the Tasks table.

The objective is to create a Project Dimension and a Task Dimension. The Task Dimension should include an additional attribute that represents the latest action associated with each task. To solve this, there are multiple approaches you could take, but two common ones stand out:

Multi-Active Satellite (MA-SAT) based on sets: The simplest approach would be to use a MA-SAT model based on sets, selecting the latest action for each task within those sets. However, this can be complex in a CDC environment.
Multi-Active Satellite (MA-SAT) based on an MA-Attribute: Instead of using sets, this model would use an action ID as an MA-Attribute, helping to simplify the approach while avoiding the issues created by capturing sets.

Challenges with the CDC Environment

When your source system uses Change Data Capture (CDC), the problem of handling sets becomes more complicated. In CDC, you’re dealing with a stream of changes where new records are added, existing records are updated, and old records may be deleted. This constant flow of data presents two key challenges:

Capturing sets is not straightforward: In a typical MA-SAT approach, a set captures all records related to a specific task or project, but with CDC data, you’re more likely to receive individual changes, rather than a full snapshot of all records at once.
Volume management: In a CDC environment, where actions are continually added, there is a risk that using sets will generate a massive volume of records. For example, if a new action is added to a task every day, this could result in an overwhelming number of records over time.

Given these challenges, using an MA-Attribute approach where the action ID serves as the key for the task’s latest action provides a more scalable solution. Instead of creating an ever-expanding set, you can focus on just the latest action for each task, keeping the volume under control.

Solution: Using a Non-Historized Link

The idea behind using a Non-Historized Link in this case is to capture actions as they come in from the source system, without the need for sets. In this setup, a Non-Historized Link captures the relationship between tasks and actions and includes the action ID. When new actions come in, they are inserted into the link table, and updates or deletions are handled as changes to the data. This provides a streamlined and efficient way to track the most recent actions associated with each task.

Handling Updates and Deletions

In CDC environments, when updates and deletions occur, it’s important to track these events accurately. The solution to this is technical counter transactions. When an update occurs in the source system, it’s treated as a deletion of the old version and an insert of the new version. This allows the system to track changes over time and ensures the history is accurately captured.

For deletions, we simply negate the value of the original record, ensuring the deleted record is not included in any aggregations. This negation technique is useful for tracking the current state of each task, action, or project.

Building the Task Dimension

The task dimension is where we’ll include the latest action for each task. To do this, we can use a technique involving a Point-in-Time (PIT) table, which captures the latest version of data (in this case, the latest action for each task) by linking the task hub to the appropriate satellite data.

Here’s how this works:

Create the PIT table for tasks: This table will store the latest version of each task, including references to the task satellite and the action satellite.
Use window functions: To retrieve the latest action for each task, you can apply window functions in your SQL queries to order by the business date (or load date) and select the latest action based on these values.

Once the PIT table is created, it can be used to build the Task Dimension, which will then reflect the latest action for each task.

Optimizing the Approach

One of the key considerations when implementing this model is performance. Using window functions and PIT tables is an effective way to manage the volume of records, but it’s important to ensure your database is optimized for these types of queries. In high-volume environments, applying business logic during the PIT table loading process can reduce the load time and make data delivery faster.

Conclusion

Modeling project tasks and actions in Data Vault requires careful consideration of the data flow, especially in environments that rely on CDC. By using Non-Historized Links, Multi-Active Satellites based on MA-Attributes, and PIT tables, you can create a scalable, efficient model that captures the latest actions associated with each task, without generating excessive volumes of data.

By applying these techniques, you can streamline the data flow, optimize performance, and ensure the data is always accurate and up-to-date, making it easier for your data users to access the information they need.

Watch the Video

Meet the Speaker

Michael Olschimke

Michael has more than 15 years of experience in Information Technology. During the last eight years he has specialized in Business Intelligence topics such as OLAP, Dimensional Modelling, and Data Mining. Challenge him with your questions!

Modeling Project Tasks and Actions in Data Vault