Parquet
The parquet connector pack allow you to read or write data from or into parquet files.
Parquet source
The ETLBox.Parquet
package include the ParquetSource
and ParquetDestination
classes.
The csv connector package is based on the Parquet .NET library created by Elastacloud . The parquet source let you read data from a parquet file.
Parquet is an open source file format, which stores data in a columnar storage format (in opposite to e.g. csv files, where data is stored row based.) While storing data in a column based manner has many advantages regarding efficiency and storage size, parquet data needs to be read as rows from the files in order to work with ETLBox.
The ParquetSource
will read all data from the parquet file as rows - starting with the first row until the end of the file (or a defined limit). Internally, the columnar format is translated into rows while reading data.
Note
Example
Let’s assume we have a parquet file “Demo.parquet” that has 2 columns, named Col1
and Col2
. Here is a simple example to read data all data from this parquet file and write it into an in-memory list.
public class MyRow
{
public int Col1 { get; set; }
public string Col2 { get; set; }
}
ParquetSource<MyRow> source = new ParquetSource<MyRow>();
source.Uri = "Demo.parquet";
MemoryDestination<MyRow> dest = new MemoryDestination<MyRow>();
source.LinkTo(dest);
Network.Execute(source);
//All rows available in dest.Data
Mapping different column names
If you want to map the defined column names in the parquet file to a property with a different name, you can use the ParquetColumn
attribute.
The following object definition would work with the same file as in the previous example, but would map Col1
from the file with the property Id
and Col2
with the property Value
.
public class MyRow
{
[ParquetColumn(ColumnName ="Col1")]
public int Id { get; set; }
[ParquetColumn(ColumnName = "Col2")]
public string Value { get; set; }
}
Using dynamic objects
If you don’t want to use a strongly typed object, you can also use the dynamic version which uses the ExpandoObject to create objects on the fly.
ParquetSource source = new ParquetSource();
source.Uri = "Demo.parquet";
MemoryDestination dest = new MemoryDestination();
source.LinkTo(dest);
Network.Execute(source);
//All rows available in dest.Data, stored in a list of ExpandoObjects
Attributes and dynamic objects
You can also use the ParquetColumn
attribute when working with dynamic objects. This code would give you the same mapping as in the example before with the strongly typed object:
ParquetSource<MyRow> source = new ParquetSource<MyRow>();
source.Uri = "Demo.parquet";
source.ParquetColumns = new[] {
new ParquetColumn() { PropertyName = "Id", ColumnName = "Col1"},
new ParquetColumn() { PropertyName = "Value", ColumnName = "Col2"},
};
MemoryDestination dest = new MemoryDestination<MyRow>();
source.LinkTo(dest);
Network.Execute(source);
Parquet Destination
ETLBox makes writing into a parquet file very simple. The ParquetDestinatinon
will convert the incoming rows from your source into a columnar format and then store the data into the file. Internally, the parquet file specifies row groups for a set of column - by default, ETLBox will create a new RowGroup foreach batch of 1000 records. This value can be changed by setting the BatchSize
property.
Example
public class MyRow
{
public int Col1 { get; set; }
public string Col2 { get; set; }
}
var source = new MemorySource<MyRow>();
source.DataAsList.Add(new MyRow() { Col1 = 1, Col2 = "Test1" });
source.DataAsList.Add(new MyRow() { Col1 = 2, Col2 = null });
source.DataAsList.Add(new MyRow() { Col1 = 3, Col2 = "Test3"});
var dest = new ParquetDestination<MyRow>("./SimpleWithObject.parquet");
source.LinkTo(dest);
Network.Execute(source);
Mapping different column names
Like in the ParquetSource, the ParquetColumn
attribute can be used to store the columns with a different name in the file. For example, the following code snipped would store the data from the properties Id
and Value
in the file with the column names Col1
and Col2
. You can set a WriteOrder
in the attribute if you want to specify a particular order when storing the columns. In this example the column Col2
would be the first column in the parquet file.
public class MyOrderedRow
{
[ParquetColumn(ColumnName = "Col1", WriteOrder =2)]
public int Id { get; set; }
public string Clutter { get; set; }
[ParquetColumn(ColumnName = "Col2", WriteOrder = 1)]
public string Value { get; set; }
}
var source = new MemorySource<MyOrderedRow>();
source.DataAsList.Add(new MyOrderedRow() { Id = 1, Value = "Test1" });
source.DataAsList.Add(new MyOrderedRow() { Id = 2, Value = null });
source.DataAsList.Add(new MyOrderedRow() { Id = 3, Value = "Test3" });
var dest = new ParquetDestination<MyOrderedRow>("./AttributesWithOrder.parquet");
source.LinkTo(dest);
Network.Execute(source);
Using dynamic objects
Of course you can also make use of the dynamic ExpandoObject when storing data with the ParquetDestination.
var source = new MemorySource();
dynamic r1 = new ExpandoObject();
r1.Col1 = 1; r1.Col2 = "Test1";
source.DataAsList.Add(r1);
dynamic r2 = new ExpandoObject();
r2.Col1 = 2; r2.Col2 = null;
source.DataAsList.Add(r2);
var dest = new ParquetDestination("./SimpleWithDynamic.parquet");
source.LinkTo(dest);
Network.Execute(source);
Manual attributes for dynamic objects
The attribute columns can also be assigned manually (either for dynamic or strongly typed objects)
var source = new MemorySource();
dynamic r1 = new ExpandoObject();
r1.Id = 1; r1.Value= "Test1";
source.DataAsList.Add(r1);
var dest = new ParquetDestination("./ManualAttributesOnDynamicWithOrder.parquet");
dest.ParquetColumns = new[] {
new ParquetColumn() { PropertyName = "Id", ColumnName = "Col1", WriteOrder = 2},
new ParquetColumn() { PropertyName = "Value", ColumnName = "Col2"}
};
source.LinkTo(dest);
Network.Execute(source);