Parquet is an open source file format, which stores data in a columnar storage format (in opposite to e.g. csv files, where data is stored row based.) While storing data in a column based manner has many advantages regarding efficiency and storage size, parquet data needs to be read as rows from the files in order to work with ETLBox.
The ParquetSource will read all data from the parquet file as rows - starting with the first row until the end of the file (or a defined limit). Internally, the columnar format is translated into rows while reading data.
Note
All streaming connectors share a set of common properties. For example, instead of reading or writing from/into a file you can set ResourceType to ResourceType.Http or ResourceType.AzureBlob in order to read or write into a web endpoint or an Azure blob. See shared functionalities for a list of all shared properties between all streaming connectors.
Let’s assume we have a parquet file “Demo.parquet” that has 2 columns, named Col1 and Col2. Here is a simple example to read data all data from this parquet file and write it into an in-memory list.
If you want to map the defined column names in the parquet file to a property with a different name, you can use the ParquetColumn attribute.
The following object definition would work with the same file as in the previous example, but would map Col1 from the file with the property Id and Col2 with the property Value.
You can also use the ParquetColumn attribute when working with dynamic objects. This code would give you the same mapping as in the example before with the strongly typed object:
ETLBox makes writing into a parquet file very simple. The ParquetDestinatinon will convert the incoming rows from your source into a columnar format and then store the data into the file. Internally, the parquet file specifies row groups for a set of column - by default, ETLBox will create a new RowGroup foreach batch of 1000 records. This value can be changed by setting the BatchSize property.
Like in the ParquetSource, the ParquetColumn attribute can be used to store the columns with a different name in the file. For example, the following code snipped would store the data from the properties Id and Value in the file with the column names Col1 and Col2. You can set a WriteOrder in the attribute if you want to specify a particular order when storing the columns. In this example the column Col2 would be the first column in the parquet file.