The Distinct transformation in ETLBox efficiently filters out duplicate records in your data flow. It operates by generating a unique hash value for each row based on specified properties. These hash values are stored internally; if a record with an identical hash appears, it is considered a duplicate and excluded. By default, the transformation considers all public properties for hash generation. However, you can specify particular properties using the DistinctColumn attribute for targeted filtering.
As a non-blocking transformation, Distinct processes each row immediately upon hash generation, contributing to a slightly larger memory footprint as it retains a hash value for each incoming row.
For strongly-typed objects, the DistinctColumn attribute allows you to mark specific properties to identify a record as unique. Here’s an example of how to set up a class using this attribute:
Note: The use of DistinctColumn attributes is optional. Without any specified attributes, all public properties are automatically used for distinctness checks.
To illustrate the Distinct transformation, consider a scenario with the MyRow class where Id and Value are marked as DistinctColumn. Records with identical Id and Value are identified as duplicates, while differing ones are treated as unique. The TestId property is disregarded in this distinction.
Consider the following dataset for transformation:
Id
Value
TestId
1
A
Test1
2
A
Test2
2
B
Test3
1
A
Test4
2
A
Test5
3
B
Test6
With the Distinct transformation applied, we expect to receive distinct rows:
Distinct can also redirect duplicates to a separate data flow. This is achieved using the LinkDuplicatesTo method. To enhance our previous example, we can direct duplicates to another destination:
The Distinct also allows to send the duplicates into another data flow. This can be easily implemented with the LinkDuplicatesTo method. We could enhance the example above to redirect the duplicates into another destination. This method’s output can be linked to an entirely new data flow. However, for simplicity in this example, we directly route the duplicates to a MemoryDestination.
The LinkDuplicatesTo method functions similarly to all LinkTo methods, allowing the output to be connected to various components through predicate logic. For illustration, consider the following example:
The Distinct transformation in ETLBox is an essential tool for filtering out duplicate records in data flows. It efficiently identifies duplicates using unique hash values, tailored through the DistinctColumn attribute for specific filtering needs. Ideal for handling large datasets and ensuring data uniqueness, this transformation significantly enhances data quality and processing efficiency in ETL workflows.