Data lake
This example shows how data from SqlServer are transferred into Azure blob storage. It shows how to setup an Azure function that runs periodically to move freshly created data rows from the database into a blob container.
Preqrequisites
This example requires a SqlServer database and either the Azure Storage Simulator or an Azure Storage account. We recommend Visual Studio to open this example.
Demo setup
Let’s assume we have a table called orders
in our database. Here is an example how this table looks like:
Id | Number | Details | Date |
---|---|---|---|
1 | OD1 | T-Shirt | 2022-01-01 12:13:00 |
2 | OD2 | Jeans | 2022-01-01 16:35:00 |
3 | OD3 | Socks | 2022-01-02 15:01:00 |
4 | OD4 | Bag | 2022-01-02 13:23:00 |
5 | OD4 | Jeans | 2022-01-03 10:07:00 |
The table contains 5 rows, and each row should represent an order. The goal is to transfer this data into an Azure Blob container, and store the data in json format. Because our table can become bigger, we want to split up our data into multiple files - in our demo, we would partition it by each day.
So for the present data rows, we would like to have 3 json files create in our blob container. E.g.
The first file (OrderData_2022-01-01.json
) should contain the first two entries as json, like this:
Additionally, we would like to deploy everything as an Azure Function. The function should run periodically (e.g. every night) and only transfer newly created records from the order table.
Azure Function
Azure Functions are a great way to deploy functionality in the cloud at comparable low costs. In our demo, we are going to use an Azure Functions that is triggered by a timer every minute. Visual studio has great templates to create a timer triggered Azure Function by scratch.
Note
Microsoft.NET.Sdk.Functions versions 3.0.4
and above, a new feature was introduced that cleans the compiled output from unnecessary dlls in order to trim down the assemblies. This feature cleans the output too aggressively and will result in exceptions when referencing and using ETLBox packages. You need to either downgrade to Microsoft.NET.Sdk.Functions 3.0.3
or to add <_FunctionsSkipCleanOutput>true</_FunctionsSkipCleanOutput>
to your project fileInitalization
First of all we can connect the ETLBox ILogger to the Azure log.
Next we set up our database connection. In Azure we can access the Configuration parameter via the environment variables. Also, we will need an SqlConnectionManager to connect with our local Sql Server.
For your local development, there should be a local.settings.json
file which contains the environment setting when debugging and testing locally.
Delta load
Before we can start to transfer the data, we need to identify all records that we haven’t transferred yet. An easy solution is to simple memorize the highest value of the Id
value that we transferred into the blob storage, assuming that the id number are always incremented.`
For the purpose of the demonstration, we can create a small data flow that reads the highest id value from our database, and also to store the value in a local json file.
Note
Setting up the data flow
Now we can set up a simple data flow that reads the last used id value from the file, and then transfer the not synced data to Azure.
In order to store our data in an object, we also need to create a strong typed object.
Using Aggregation to get Max(Id)
That data flow configuration is already sufficient to do an initial load. But after we have transferred that records into the Azure blob, how can we retrieve the highest id value? One solution would be to some RowTransformation
that always checks the id value of the current row and compares it with a current value stored in a variable. Or we can use the Aggregation
which behaves similar. So we can split our data flow before we insert the data, and retrieve the max value of the id from the aggregation output.
Now we can link the components and execute the network. After this, we need to store the max id value from our input data in our file, so that the next run can read the highest id value and continue to transfer only new data.
Code on Github
You can find this demo in GitHub. The example here will contain additional code, which helps to create the demo data. Also, you’ll find a small helper task that creates new data and inserts it into the database.
The whole code for this example is available on GitHub