Distinct

The Distinct transformation in ETLBox filters out duplicate records in a data flow by evaluating each row for uniqueness. It uses a hash-based mechanism to compare values and retain only the first occurrence of each unique row. All other duplicates are discarded or optionally redirected to a separate destination.

Overview

  • Type: Non-blocking transformation
  • Execution: Row-by-row with in-memory hash tracking
  • Buffers: One input buffer (keeps track of hashes)

By default, Distinct uses all public properties to compute a row’s uniqueness. However, you can narrow this down using the [DistinctColumn] attribute or the DistinctColumns property to focus only on selected fields.

Buffer Mechanics

Distinct processes rows as they arrive and computes a hash to identify duplicates. These hash values are stored in memory, which can grow in size depending on the dataset and distinct criteria.

Using Attributes for Uniqueness

If using POCOs, you can mark specific fields with [DistinctColumn] to define uniqueness.

public class MyRow {
    [DistinctColumn]
    public int Id { get; set; }

    [DistinctColumn]
    public string Value { get; set; }

    public string TestId { get; set; } // Not considered for distinct comparison
}

Example: Basic Deduplication

This example uses the above MyRow class. Rows with the same Id and Value are treated as duplicates.

Input Data:

IdValueTestId
1ATest1
2ATest2
2BTest3
1ATest4
2ATest5
3BTest6

Output:

IdValueTestId
1ATest1
2ATest2
2BTest3
3BTest6
var source = new MemorySource<MyRow>();
source.DataAsList.Add(new MyRow() { Id = 1, Value = "A", TestId = "Test1" });
source.DataAsList.Add(new MyRow() { Id = 2, Value = "A", TestId = "Test2" });
source.DataAsList.Add(new MyRow() { Id = 2, Value = "B", TestId = "Test3" });
source.DataAsList.Add(new MyRow() { Id = 1, Value = "A", TestId = "Test4" });
source.DataAsList.Add(new MyRow() { Id = 2, Value = "A", TestId = "Test5" });
source.DataAsList.Add(new MyRow() { Id = 3, Value = "B", TestId = "Test6" });

var trans = new Distinct<MyRow>();
var dest = new MemoryDestination<MyRow>();

source.LinkTo(trans);
trans.LinkTo(dest);
Network.Execute(source);

foreach (var row in dest.Data)
    Console.WriteLine($"Id:{row.Id} Value:{row.Value} TestId:{row.TestId}");

//Output
/*
Id:1 Value:A TestId:Test1
Id:2 Value:A TestId:Test2
Id:2 Value:B TestId:Test3
Id:3 Value:B TestId:Test6
*/

Redirecting Duplicates

Duplicates can be routed to another data flow using LinkDuplicatesTo. This enables advanced handling such as logging, enrichment, or rerouting for further validation. For example, duplicates could be sent to a transformation chain that logs issues, stores them in a separate error table, or applies a different transformation logic depending on your data requirements.

var duplicateDest = new MemoryDestination<MyRow>();
trans.LinkDuplicatesTo(duplicateDest);

Output:

Duplicate - Id:1 Value:A TestId:Test4
Duplicate - Id:2 Value:A TestId:Test5

You can also apply predicates when redirecting duplicates:

trans.LinkDuplicatesTo(destDuplicates1, row => row.Id == 1);
trans.LinkDuplicatesTo(destDuplicates2, row => row.Id >= 2);

Dynamic Object Support

Distinct works with ExpandoObject. Use DistinctColumns to define which fields determine uniqueness.

var trans = new Distinct();
trans.DistinctColumns = new[] {
    new DistinctColumn() { DistinctPropertyName = "DistinctCol1" },
    new DistinctColumn() { DistinctPropertyName = "DistinctCol2" }
};
dynamic row = new ExpandoObject();
row.DistinctCol1 = 1;
row.DistinctCol2 = "A";
row.OtherValue = "Example";

You can still use LinkDuplicatesTo() with dynamic objects, including filtering by ExpandoObject properties.

Custom Uniqueness Logic

You can define a custom function to control how uniqueness is evaluated. This overrides attribute- or column-based logic.

trans.GetUniqueKeyFunc = row => row.Value.Substring(0, 1).ToLower();

This allows advanced matching (e.g., case-insensitive, substring-based, etc.).

Manual DistinctColumn Setup

Instead of using attributes, you can set DistinctColumns programmatically:

trans.DistinctColumns = new[] {
    new DistinctColumn() { DistinctPropertyName = "Col1" },
    new DistinctColumn() { DistinctPropertyName = "Col2" }
};

Setting DistinctColumns programmatically overrides any [DistinctColumn] attributes defined on your class.

Monitoring Metrics

The Distinct transformation provides the following runtime metrics to help track processing results:

  • DistinctCount The number of rows that were identified as distinct and passed to the output.

  • DuplicateCount The number of rows detected as duplicates and either discarded or redirected via LinkDuplicatesTo.

These counters are automatically reset on each execution and are useful for validation, logging, or audit purposes.

Example:

Console.WriteLine($"Distinct Rows: {trans.DistinctCount}");
Console.WriteLine($"Duplicate Rows: {trans.DuplicateCount}");