This demo will scrape data from a website and sent it into a data flow using a custom source.
The raw data will be stored in a SQLite database. Also, the raw data is loaded into a in-memory list object, and then some base information about the scraped data is displayed. We will also aggregate the data and store the result in a csv file.
We will use the ASN Aviation Safety Database
to retrieve data about tracked airline accidents. The database is a web site that contains information airline accidents that happened since 1919. We will query this database by querying the website for each year. For demonstration purposes, we will only query the years between 1920 and 1940.
We define a class that can hold all data for an accidents:
We also want to aggregate this data. We are interest in the total of all Fatalaties per year.
As we are using the Aggregation, we can create a class where we define our aggregation actions via the AggregateColumn and GroupColumn attributes.
The demo already comes with file named SQLite.db - this is an empty SQLite database file. We can now use the SQLite connection manager to create a table in this database:
Before we start, we need a function that returns us all accidents that we scraped from the website for a particular year. E.g. we want to get a list of all accidents for 1920 for the url https://aviation-safety.net/database/dblist.php?Year=1920
We will use the Html Agilitiy Pack
to retrieve the data in very simple way.
Our scraping code for the url above would look like this:
Next we want to use the scraped accident data and sent it into a data flow. We can use the CustomBatchSource to sent a list of any size into a data flow.
The scraper that we created above has a flaw: Some records don’t have a valid year, the year contains the number 1. We can filter them out using the FilterTransformation.
Note
There are other ways of handling flawed data. You can use Exceptions and the LinkErrorTo() to redirect exceptional data into different destinations.
Now we are ready to “duplicate” our data.
We want to store the raw data in the SQLite table that we created. Also, we want to store the data in a in-memory list object, and then display some basic information about this data. Finally, we want to aggregate the data to Fatalities/Year and then store this information in a csv file.