Polyglot Notebooks
This article was also published as a blog article 'Unlock the Power of C# in Polyglot Notebooks'
Introduction
C# and Polyglot Notebooks
C# is a powerful programming language widely used for building a variety of applications, from web apps to complex data processing systems. Polyglot notebooks, like Jupyter notebooks, allow you to write and execute code in multiple programming languages within the same document. This makes it easier to share and collaborate on projects, especially those involving data analysis, visualization, or machine learning.
Traditionally, C# hasn’t been as common in the realm of polyglot notebooks, which have been more associated with languages like Python and R. However, Microsoft has made significant improvements in recent years, making it much easier to use C# within these environments. With these advancements, we can now leverage the full power of C# for tasks that require high performance and robust tooling.
The Use-case of ETL in a Polyglot Notebook
ETL stands for Extract, Transform, Load, and it’s a process commonly used in data management to move data from one place to another, transform it into a usable format, and then load it into a database or another storage system. In polyglot notebooks, basic tasks like loading data from a CSV file are supported out-of-the-box. However, when dealing with more complex data sources such as databases or web APIs, and then transforming this data, ETL pipelines become indispensable.
We are using ETLBox, a library designed for ETL processes in .NET environments, because it allows us to write ETL code in C#. By using a polyglot notebook, we can combine the flexibility of different programming languages with the power of C# for data manipulation.
How to Create Your Own Polyglot Notebook
To create your own polyglot notebook in VS Code, follow these steps:
Prerequisites: Make sure you have Visual Studio Code (VS Code) installed on your computer. You will also need to install the Polyglot Notebooks extension to enable multi-language support within your notebook.
Press
CTRL + SHIFT + P
to open the command dialogue.Select
Polyglot Notebook: Create Default Notebook
.Choose the
.ipynb
extension and select theC#
language.Add the following code blocks to start coding in C#:
(Note the missing
;
at the end of the last line)Run the code: This should print
Hello, Polyglot Notebook!
as output.
These steps will set up a fresh new notebook where you can start exploring using Polyglot Notebooks with C#.
Setting Up a Database Table for ETL Using C# in a Polyglot Notebook
Let’s see how to set up a simple database table using C# within a polyglot notebook.
Install ETLBox.SqlServer Package
Before we begin, ensure you have the ETLBox.SqlServer package installed. This package provides the necessary tools to interact with SQL Server databases using ETLBox.
Set Up Connection Credentials
Replace the connection string in the code below with your SQL Server credentials. This example assumes you are connecting to a local SQL Server instance (
localhost
), using thesa
user with the passwordYourStrong@Passw0rd
, and targeting a database nameddemo
.Define Table Structure
We’ll define a simple table named
Test
with three columns:Id
(INT, identity column),XValue
(DATETIME), andYValue
(INT). This table will store sample data for demonstration purposes.Create the Table
Before creating the table, ensure any existing table with the same name is dropped to start fresh. This step is handled by the
DropTableTask.DropIfExists
method.Insert Sample Data
Populate the
Test
table with sample data using SQL INSERT statements.Execute the code
After executing the code, check your SQL Server database (
demo
in this example) to verify that theTest
table has been created and populated with the sample data.
Loading Data using an ETL Pipeline
Let’s see how to load data from a SQL Server database into memory and analyze it using DataFrames in a polyglot notebook environment.
Setup Environment
Ensure you have the necessary libraries installed and configured. We are using ETLBox for ETL operations, including data extraction from SQL Server, and Microsoft’s DataFrame for data analysis.
Loading Data from Database
Initialize the connection manager and define a
DbSource
to extract data from theTest
table in our SQL Server database.Transforming Data
Use a
RowTransformation
to manipulate the data in memory. In this example, we multiply theYValue
column by 1000.Multicasting Data
Utilize a
Multicast
component to split the data flow into two branches:- One branch feeds data into a
MemoryDestination
, storing data in a C# List (memDest.Data
). - The other branch stores data into a
DataFrameDestination
, creating a DataFrame (dfDest.DataFrame
) for further analysis.
- One branch feeds data into a
Running the Pipeline
Until now, we’ve defined our pipeline components. Now, let’s link them together and trigger the data flow.
Accessing Loaded Data
We stored the data loaded into the
MemoryDestination
(memDest.Data
) and the DataFrame fromDataFrameDestination
(dfDest.DataFrame
) for further analysis and manipulation in variables in our notebook.
Displaying and Analyzing Data
Let’s see how we can display and analyze the data loaded into tjhe polyglot notebook.
Displaying the DataFrame
To begin with, let’s display the contents of the DataFrame (df
). This DataFrame contains the transformed data from our ETL pipeline.
We will get an output like this:
index | Id | XValue | YValue |
---|---|---|---|
0 | 1 | 2022-01-01 00:00:00Z | 100000 |
1 | 2 | 2022-01-02 00:00:00Z | 350000 |
2 | 3 | 2022-01-03 00:00:00Z | 470000 |
3 | 4 | 2022-01-04 00:00:00Z | 134000 |
4 | 5 | 2022-01-05 00:00:00Z | 42000 |
Transforming the DataFrame
We can perform additional transformations directly on the DataFrame. In this example, we’ll create a new column x
by doubling the values in the YValue
column.
Now our output becomes this:
index | Id | XValue | YValue |
---|---|---|---|
0 | 1 | 2022-01-01 00:00:00Z | 100000 |
1 | 2 | 2022-01-02 00:00:00Z | 350000 |
2 | 3 | 2022-01-03 00:00:00Z | 470000 |
3 | 4 | 2022-01-04 00:00:00Z | 134000 |
4 | 5 | 2022-01-05 00:00:00Z | 42000 |
The Microsoft.Analysis.DataFrame
is highly effective as an in-memory structure for data analysis and querying.
You can learn more about it here in the Microsft Documentation: Getting started with DataFrames
Visualizing Data with ScottPlot
Next, we’ll visualize the loaded data using ScottPlot, a plotting library for .NET.
This will print out our data as a plot:
Final Thoughts
Using C# in polyglot notebooks allows data professionals to work without needing to learn a new programming language. Polyglot notebooks enable users to use their existing knowledge about C# and the .NET framework in one environment, enhanced by the power of Jupyter notebooks.
By default, loading data using existing data frameworks only support loading basic files (like CSV). This makes the inclusion of ETL processes in notebooks essential for managing more complex data tasks. ETLBox, tailored for C#, stands out as the best framework designed specifically for this purpose.
Together, C#, Polyglot notebooks and ETLBox simplify data processing, supporting data professional to work more efficiently and make better decisions based on their data.
Code on Github
The notebook code for this example is available on GitHub for further exploration and contribution. .