Merge Join
The MergeJoin
transformation in ETLBox allows you to combine rows from two separate input streams into one output. It functions similarly to a merge or join operation in relational databases. Each side of the join — left and right — is linked to a source or upstream transformation. You then define a function that describes how to merge the records from each side.
The MergeJoin
can operate in two modes:
- Always Join: Each left row is joined with the corresponding right row in order.
- Conditional Join with Comparison: Records are joined only if they meet a specified condition.
Buffering Behavior
- The
MergeJoin
has one input buffer for each input (LeftInput
andRightInput
). - It behaves as non-blocking when performing an always join.
- It behaves as partially blocking when using a comparison function, buffering rows until a matching pair is found.
Always Join
This is the default mode when no comparison function is set. The component joins records from the left and right inputs in the order they arrive. This mode is best suited when both inputs have the same number of rows and are naturally aligned.
If one input has more rows than the other, extra rows are joined with null
.
Example: Always Join
public class MyLeftRow {
public string FirstName { get; set; }
}
public class MyRightRow {
public string LastName { get; set; }
}
public class MyOutputRow {
public string FullName { get; set; }
}
public static void Main() {
var source1 = new MemorySource<MyLeftRow>();
source1.DataAsList.Add(new MyLeftRow() { FirstName = "Elvis" });
source1.DataAsList.Add(new MyLeftRow() { FirstName = "Marilyn" });
var source2 = new MemorySource<MyRightRow>();
source2.DataAsList.Add(new MyRightRow() { LastName = "Presley" });
source2.DataAsList.Add(new MyRightRow() { LastName = "Monroe" });
var join = new MergeJoin<MyLeftRow, MyRightRow, MyOutputRow>(
(leftRow, rightRow) => {
return new MyOutputRow() {
FullName = leftRow.FirstName + " " + rightRow.LastName
};
});
var dest = new MemoryDestination<MyOutputRow>();
source1.LinkTo(join.LeftInput);
source2.LinkTo(join.RightInput);
join.LinkTo(dest);
Network.Execute(source1, source2);
foreach (var row in dest.Data)
Console.WriteLine(row.FullName);
//Outputs:
//Elvis Presley
//Marilyn Monroe
}
Join with Comparison
The MergeJoin
transformation allows you to specify a match and comparison function that determines which records should be joined. Only if both records match according to the comparison function will they be joined. Otherwise, one of the records will be passed through with a null
as a placeholder for the unmatched side.
For optimal performance, the MergeJoin
requires sorted input from both sources. The rows should be ordered based on the property used in the comparison. You can pass a ComparisonFunc<TInput1, TInput2, int>
delegate to the MergeJoin
, which returns an integer based on the comparison between the two input records. This function follows the standard comparison delegate pattern.
- If the
ComparisonFunc
returns0
, both records are considered a match and are joined. The joined result is sent to the output. - If the
ComparisonFunc
returns a value less than0
, the record from the left input is considered to come before the right one. The left record is sent to the output along with anull
as a placeholder for the right. - If the
ComparisonFunc
returns a value greater than0
, the record from the right input comes first. The right record is sent to the output with anull
placeholder for the left.
To summarize: when the comparison result is 0
, both rows are joined and passed on. If not, the record from one side is passed along with a null
for the unmatched side.
If either input source has unmatched trailing rows, those records will be sent to the output with a null value for the other side — even when using a comparison function.
Example: Join with Comparison
public class MyRow {
public int Id { get; set; }
public string FirstName { get; set; }
public string LastName { get; set; }
public string FullName { get; set; }
}
public void JoinWithComparisonExample() {
var source1 = new MemorySource<MyRow>();
source1.DataAsList.Add(new MyRow() { Id = 1, FirstName = "Elvis" });
source1.DataAsList.Add(new MyRow() { Id = 2, FirstName = "Psy" });
source1.DataAsList.Add(new MyRow() { Id = 3, FirstName = "Marilyn" });
var source2 = new MemorySource<MyRow>();
source2.DataAsList.Add(new MyRow() { Id = 1, LastName = "Presley" });
source2.DataAsList.Add(new MyRow() { Id = 3, LastName = "Monroe" });
var join = new MergeJoin<MyRow>(
(leftRow, rightRow) => {
if (rightRow == null)
leftRow.FullName = leftRow.FirstName + " " + "Unknown";
else
leftRow.FullName = leftRow.FirstName + " " + rightRow.LastName;
return leftRow;
});
join.ComparisonFunc = (inputRow1, inputRow2) => {
if (inputRow1.Id == inputRow2.Id)
return 0;
else if (inputRow1.Id < inputRow2.Id)
return -1;
else
return 1;
};
var dest = new MemoryDestination<MyRow>();
source1.LinkTo(join.LeftInput);
source2.LinkTo(join.RightInput);
join.LinkTo(dest);
Network.Execute(source1, source2);
foreach (var row in dest.Data)
Console.WriteLine(row.FullName);
//Outputs:
//Elvis Presley
//Psy Unknown
//Marilyn Monroe
}
When handling join logic, always check for null
in either the leftRow
or rightRow
inside the MergeJoinFunc
, as one side may be unmatched (depending on the comparison result or unmatched trailing rows in one input source).
Sorting Requirements
When using a ComparisonFunc
, both inputs should be sorted on the matching key. This is critical to ensure correct and predictable behavior.
You can use the Sort transformation to sort the data explicitly if your source does not provide sorted output.
The MergeJoin
does not validate sort order; incorrect sorting can lead to incorrect joins or missing matches.
Types and Constructors
The MergeJoin
supports:
- Different types for left input (
TInput1
), right input (TInput2
), and output (TOutput
) - A simplified version:
MergeJoin<T>
for same input/output types - A dynamic version:
MergeJoin
(usesExpandoObject
)
You can create it using:
var join = new MergeJoin<TInput1, TInput2, TOutput>();
join.MergeJoinFunc = (left, right) => { ... };
or using the constructor directly:
var join = new MergeJoin<TInput, TInput, TInput>((left, right) => { ... });
Here is the additional section for Dynamic Object Support in MergeJoin
, including a complete example. You can place this section before the Summary section in the documentation.
Dynamic Object Support
The MergeJoin
also works with dynamic objects (ExpandoObject
). This is especially useful when the schema is not known at compile time or varies across datasets. The dynamic version of MergeJoin
allows you to work with property names as strings at runtime.
Example: Dynamic MergeJoin
var source1 = new MemorySource();
dynamic left1 = new ExpandoObject();
left1.FirstName = "Elvis";
dynamic left2 = new ExpandoObject();
left2.FirstName = "Marilyn";
source1.DataAsList.Add(left1);
source1.DataAsList.Add(left2);
var source2 = new MemorySource();
dynamic right1 = new ExpandoObject();
right1.LastName = "Presley";
dynamic right2 = new ExpandoObject();
right2.LastName = "Monroe";
source2.DataAsList.Add(right1);
source2.DataAsList.Add(right2);
var join = new MergeJoin(
(left, right) => {
dynamic output = new ExpandoObject();
output.FullName = $"{((dynamic)left).FirstName} {((dynamic)right).LastName}";
return output;
});
var dest = new MemoryDestination();
source1.LinkTo(join.LeftInput);
source2.LinkTo(join.RightInput);
join.LinkTo(dest);
Network.Execute(source1, source2);
foreach (dynamic row in dest.Data)
Console.WriteLine(row.FullName);
// Output:
// Elvis Presley
// Marilyn Monroe
You can also use ExpandoObject
with a ComparisonFunc
if you want conditional joining, using similar runtime property access.