Distinct

The Distinct transformation in ETLBox filters out duplicate records in a data flow by evaluating each row for uniqueness. It uses a hash-based mechanism to compare values and retain only the first occurrence of each unique row. All other duplicates are discarded or optionally redirected to a separate destination.

Overview

Type: Non-blocking transformation
Execution: Row-by-row with in-memory hash tracking
Buffers: One input buffer (keeps track of hashes)

By default, Distinct uses all public properties to compute a row’s uniqueness. However, you can narrow this down using the [DistinctColumn] attribute or the DistinctColumns property to focus only on selected fields.

Buffer Mechanics

Distinct processes rows as they arrive and computes a hash to identify duplicates. These hash values are stored in memory, which can grow in size depending on the dataset and distinct criteria.

Using Attributes for Uniqueness

If using POCOs, you can mark specific fields with [DistinctColumn] to define uniqueness.

public class MyRow {
    [DistinctColumn]
    public int Id { get; set; }

    [DistinctColumn]
    public string Value { get; set; }

    public string TestId { get; set; } // Not considered for distinct comparison
}

Example: Basic Deduplication

This example uses the above MyRow class. Rows with the same Id and Value are treated as duplicates.

Input Data:

Id	Value	TestId
1	A	Test1
2	A	Test2
2	B	Test3
1	A	Test4
2	A	Test5
3	B	Test6

Output:

Id	Value	TestId
1	A	Test1
2	A	Test2
2	B	Test3
3	B	Test6

var source = new MemorySource<MyRow>();
source.DataAsList.Add(new MyRow() { Id = 1, Value = "A", TestId = "Test1" });
source.DataAsList.Add(new MyRow() { Id = 2, Value = "A", TestId = "Test2" });
source.DataAsList.Add(new MyRow() { Id = 2, Value = "B", TestId = "Test3" });
source.DataAsList.Add(new MyRow() { Id = 1, Value = "A", TestId = "Test4" });
source.DataAsList.Add(new MyRow() { Id = 2, Value = "A", TestId = "Test5" });
source.DataAsList.Add(new MyRow() { Id = 3, Value = "B", TestId = "Test6" });

var trans = new Distinct<MyRow>();
var dest = new MemoryDestination<MyRow>();

source.LinkTo(trans);
trans.LinkTo(dest);
Network.Execute(source);

foreach (var row in dest.Data)
    Console.WriteLine($"Id:{row.Id} Value:{row.Value} TestId:{row.TestId}");

//Output
/*
Id:1 Value:A TestId:Test1
Id:2 Value:A TestId:Test2
Id:2 Value:B TestId:Test3
Id:3 Value:B TestId:Test6
*/

Redirecting Duplicates

Duplicates can be routed to another data flow using LinkDuplicatesTo. This enables advanced handling such as logging, enrichment, or rerouting for further validation. For example, duplicates could be sent to a transformation chain that logs issues, stores them in a separate error table, or applies a different transformation logic depending on your data requirements.

var duplicateDest = new MemoryDestination<MyRow>();
trans.LinkDuplicatesTo(duplicateDest);

Output:

Duplicate - Id:1 Value:A TestId:Test4
Duplicate - Id:2 Value:A TestId:Test5

You can also apply predicates when redirecting duplicates:

trans.LinkDuplicatesTo(destDuplicates1, row => row.Id == 1);
trans.LinkDuplicatesTo(destDuplicates2, row => row.Id >= 2);

Dynamic Object Support

Distinct works with ExpandoObject. Use DistinctColumns to define which fields determine uniqueness.

var trans = new Distinct();
trans.DistinctColumns = new[] {
    new DistinctColumn() { DistinctPropertyName = "DistinctCol1" },
    new DistinctColumn() { DistinctPropertyName = "DistinctCol2" }
};

dynamic row = new ExpandoObject();
row.DistinctCol1 = 1;
row.DistinctCol2 = "A";
row.OtherValue = "Example";

You can still use LinkDuplicatesTo() with dynamic objects, including filtering by ExpandoObject properties.

Custom Uniqueness Logic

You can define a custom function to control how uniqueness is evaluated. This overrides attribute- or column-based logic.

trans.GetUniqueKeyFunc = row => row.Value.Substring(0, 1).ToLower();

This allows advanced matching (e.g., case-insensitive, substring-based, etc.).

Manual DistinctColumn Setup

Instead of using attributes, you can set DistinctColumns programmatically:

trans.DistinctColumns = new[] {
    new DistinctColumn() { DistinctPropertyName = "Col1" },
    new DistinctColumn() { DistinctPropertyName = "Col2" }
};

Setting DistinctColumns programmatically overrides any [DistinctColumn] attributes defined on your class.

Monitoring Metrics

The Distinct transformation provides the following runtime metrics to help track processing results:

DistinctCount The number of rows that were identified as distinct and passed to the output.
DuplicateCount The number of rows detected as duplicates and either discarded or redirected via LinkDuplicatesTo.

These counters are automatically reset on each execution and are useful for validation, logging, or audit purposes.

Example:

Console.WriteLine($"Distinct Rows: {trans.DistinctCount}");
Console.WriteLine($"Duplicate Rows: {trans.DuplicateCount}");

Edit this page on GitHub

Column Transformation

Conditional Split

Docs

ETLBox

Title here

Distinct

Overview

Buffer Mechanics

Using Attributes for Uniqueness

Example: Basic Deduplication

Redirecting Duplicates

Dynamic Object Support

Custom Uniqueness Logic

Manual DistinctColumn Setup

Monitoring Metrics

Example:

Distinct

Overview#

Buffer Mechanics#

Using Attributes for Uniqueness#

Example: Basic Deduplication#

Redirecting Duplicates#

Dynamic Object Support#

Custom Uniqueness Logic#

Manual DistinctColumn Setup#

Monitoring Metrics#

Example:#

Overview

Buffer Mechanics

Using Attributes for Uniqueness

Example: Basic Deduplication

Redirecting Duplicates

Dynamic Object Support

Custom Uniqueness Logic

Manual DistinctColumn Setup

Monitoring Metrics

Example: