Blocking Transformations

Blocking transformations accumulate input before producing output. They are essential for operations that require access to full datasets or grouped subsets, such as sorting, aggregation, or batching.

Batch-Oriented and Buffered Processing

Blocking transformations accumulate input before generating output. Unlike non-blocking transformations, which stream rows immediately, blocking components require a certain amount of buffered data depending on their function:

  • Some wait for all rows (e.g., Sort).
  • Some process batches of rows (e.g., BatchTransformation).
  • Some only buffer partial aggregates (e.g., Aggregation).
diagram

Buffering

All blocking transformations include an input buffer and an internal buffer to temporarily hold data. The internal buffer size and structure depend on the transformation:

  • A Sort may hold all rows in memory.
  • An Aggregation may store only grouped values.
  • A BatchTransformation manages rows in fixed-size batches.

Some transformations may also include an output buffer, though this varies by component.

The MaxBufferSize setting allows configuration of buffer limits, but its effect depends on the transformation:

  • May be ignored (e.g., Sort).
  • May affect only the output buffer (e.g., Aggregation).
  • May apply to all buffers (e.g., BatchTransformation).
transformation.MaxBufferSize = 50000; // Component-specific
Settings.MaxBufferSize = 10000;       // Global default

Non-Blocking vs Blocking

ETLBox distinguishes between:

  • Non-blocking transformations: Stream and process rows immediately.
  • Blocking transformations: Accumulate data before producing output.

Blocking is required for transformations like sorting, aggregating, or merging across full datasets.

See also: Non-Blocking Transformations Overview

Summary of Blocking Transformations

TransformationDescription
AggregationGroups and aggregates rows based on keys
BatchTransformationProcesses rows in complete batches
CachedBatchTransformationBatch transform with access to cached data
CrossJoinPerforms a Cartesian join between two input streams
PivotTransforms row-level data into column-oriented output
SortSorts all rows using a comparison function
WaitTransformationHolds execution until one or more sources complete

Descriptions

Aggregation: Groups rows by a key and aggregates their values. Buffers only aggregation state, using minimal memory.

BatchTransformation: Collects input rows into batches and processes them as a group.

CachedBatchTransformation: Extends batch processing by using cached data for enrichment or merging logic.

CrossJoin: Joins every row from one input with every row from the other. Buffers one side of the join entirely.

Pivot: Converts grouped rows into columns. Requires buffering of grouped input.

Sort: Buffers all input rows and produces sorted output when all data is received.

WaitTransformation: Pauses execution until one or more input components complete. Useful for orchestration.