In Part 1, we explored ETLBox, a .NET library designed for efficient data management. Without the need for a graphical interface, ETLBox simplifies data processes by providing an intuitive API, supporting various transformations and data sources, and ensuring optimal performance. It’s specifically tailored for the .NET framework and emphasizes stable ETL processes. We also discussed how to use ETLBox packages from NuGet and demonstrated how to process a CSV file of supplier data, emphasizing steps like validation and error management.
In this part, we’ll enhance this data flow. We’ll not only boost error logging but also tackle the challenge of managing repeated or varied supplier data loads. Our goal is to create a Slowly Changing Dimension Type 2 (SCD Type 2) to capture changes in the CSV data.
Errors are inevitable in any data processing task. Our solution in this example is to add a custom error message to each faulty record and store it in our pre-existing error file. If you are interested in a more sophisticated way of redirecting error, you can also have a look at the LinkErrorTo methods provided on each data flow component. To keep this example simple, we will just focus on adding a simple error message to the data itself.
SCD Type 2 (Slowly Changing Dimension Type 2) is a methodology used to track historical changes in data. We also want to improve our data flow to now work with a SCD Type 2. With Type 2, we get historical data tracking. This ensures that the data’s history is preserved, and any changes can be tracked over time. To achieve this, each historical record in an SCD Type 2 gets a validity range, using a ValidFrom and ValidTo date column. When a new record is inserted, it gets a ValidFrom date which is in the past, before any of our transactional data starts. The ValidTo is a date in the far future. When a new record is created due to a change, the old record is given an “end date”, which is also the start date of the new record.
In order to append an error message when a record becomes invalid, we integrate the ErrorMessage property into our VendorMaster class. Simultaneously, the IsValid method is modified to craft appropriate error messages for such records.
Subsequently, we need to refine the links to our error destination.
Previously, when segregating valid from invalid records towards the error destination, our approach was:
Nevertheless, invoking the IsValid method twice could boost readability but might also cause unintended side effects. Importantly, when creating links between components using predicates, the sequence in which these links are established becomes crucial. Predicates undergo tests in the order they’re appended. Once a predicate evaluates as true, subsequent ones aren’t examined. Hence, all valid rows dispatched to the lookup using the first link implies we can enhance our code for identical behaviour as shown:
Similarly, while linking the Lookup to the Distinct transformation, we want to embed an error message for records already present in our database. Here, for demonstration purposes, we will include the error message within our predicate method.
Lastly, our objective is to associate an error message with rows identified as redundant by the Distinct transformation. We employ an extra RowTransformation to achieve this:
Previously, our aim was to exclude records that were already present in the database. However, with SCD Type 2, instead of filtering these records, we should introduce them with a fresh ValidFrom/ValidTo range and adjust the existing record’s validity period to conclude.
When we identify a pre-existing record, we will designate the ValidFrom attribute to the present timestamp. For this task, we’ll utilize a row transformation:
Let’s incorporate this transformation into our data flow and eliminate the prior filtering mechanism for extant records:
Executing our script now would result in the ValidFrom of already present records being set to the current timestamp. Nonetheless, we’re yet to modify the prevailing database record such that its ValidTo corresponds to the new record’s ValidFrom. The database destination permits the inclusion of custom logic post writing a data batch. Consequently, we can leverage SQL to align the ValidFrom/ValidTo columns, deploying window functions.
It’s important to acknowledge that the above SQL code updates all entries across the entire table. For brevity, the SQL script isn’t restricted solely to the batch-specific entries. If you desire to limit the updates to the current batch’s scope, the batch variable lets you interact with the objects unique to that batch, encompassing all incoming data. Thus, if such a specification is of interest, you might consider integrating the Id column to refine your SQL. Currently, the DbId property stores the Id values, but only for entries that match the lookup. For a comprehensive retrieval of all id values, consider reading back the auto-generated ids from your database:
With this setup, the batch variable grants access to all auto-generated Id values.
We’ve made big steps to improve our data process. First, we added error messages. This means if something goes wrong with the data, we’ll know about it right away and can fix it. Then, we started using SCD Type 2. With this, we can see the history of our data, which is great. We can track any changes and understand why they happened. These changes don’t just make our data look good; they also make our whole process more reliable and easy to follow. We’re now set up for success!
So far, we’ve enhanced error logging and integrated the SCD Type 2 methodology. In Part 3, we’re taking it up a notch. Let’s challenge the constraints of static types and aim for a dataflow that’s entirely configurable. This means more flexibility, scalability, and adaptability without the rewrites. Stay tuned for a dynamic turn in our data management journey!