How to Build a Data Pipeline on Azure Data Factory With Source Control Integrations

Ifeanyi Ibem
8 min readMay 20, 2023

--

To be a successful Data Engineer, you need to have hands-on experience in exploring data, building, implementing, and maintaining secure and efficient data processing pipelines using different tools and techniques. Some people may ask what data engineers do. Well, as a data engineer, you are accountable for data-oriented implementation tasks, ingesting, transformation, and finding bottlenecks in the performance and security of data.

ADF architecture

In the article, I have highlighted a more straightforward way you can automate your data project in Azure by simplifying that through a simple project. The aim is to build a dataflow pipeline on Azure Data Factory that will extract data from an Azure SQL database table and load the extracted data into a new Azure SQL table and Azure blob storage in CSV format. After that, we set up a Continous Integration Continous Delivery process using Git source control and push our pipeline to GitHub.

What is the prerequisite for this project?

In this project, you need to have a basic understanding of the following technologies:

✔ Azure subscription

✔ Azure SQL database

✔ Azure storage account

✔ Azure data factory

✔ Source Control (Git/GitHub)

✔Azure Subscription

✅ You can build on Azure for free for 30 days. You create a free account and get $200.

✔ Set up your Azure SQL Database

✅ You create a single database in Azure SQL Database using either the Azure portal, a PowerShell script, or an Azure CLI script.

Follow these steps in this documentation to create a single database in Azure SQL.

Please note the following when creating the Azure SQL database:

✅ On the Firewall rules, set Add current client IP address to Yes. this allows the IP of your computer to access the SQL database.

✅ For this project, select Sample on the Additional Settings tab in the Data source section to Use existing data. This creates an AdventureWorksLT sample database so there are some tables and data to query and experiment with instead of an empty blank database. See the screenshot below:

SQL Database

Once your database is created, you can use the Query editor (preview) in the Azure portal to connect to the database and query data.

✔ Azure Storage account

We need to create a container on the Azure storage account to load the data extracted from the SQL database through ADF.

An Azure storage account contains all of your Azure Storage data objects: blobs, files, queues, and tables. The storage account provides a unique namespace for your Azure Storage data accessible from anywhere in the world over HTTP or HTTPS.

Here, we need to:

  1. Create a storage account. Follow these steps here
  2. Navigate to your new storage account created above. In the left menu for the storage account, scroll to the Data storage section, then select Containers. Follow these steps here to create a container.

You can choose to name the container “raw” or any other random name. A sample of the created container is shown below:

✔ Azure Data Factory

For this step, you need an Azure subscription. If you don’t have an Azure subscription, create a free Azure account before you begin.

Follow These Steps to Create Data Factory Project :

✅ On the SQL Dataflow, click on Add Source and set up the data source which you want to ingest the data from. In this project, our data source is Azure SQL Database. Before you proceed to the activity, you must create a dataset for your source activity (data source). Click on the new dataset option provided in the Source settings of the source activity. This is where you will reference a dataset to specify the location and structure of your data within a data store — as shown below

ADF dataflow source

In the data set page, select Azure SQL database (for this project). After selecting the data source, you need to create a linked service for the data set if none exists. Linked Service in the data flow factory is responsible for creating a connection string between Azure Data Factory and the data source.

Check this reference on how to create a linked service in Azure Data Factory

For the purpose of the project, we connected to the Product Table in the Azure SQL database created earlier

  • The next activity to add to the data flow is the Derived Column — after connecting the data set is to do some transformation to the date column columns by changing the SellStartDate, SellEndDate, DiscontinuedDate, and ModifiedDate of the Product Table to a different format before loading them to the destination storage.

Note: we need to add a new branch activity on this step because we want to load the data in two destinations (blob storage and Azure SQL Table). Each of the derived columns on the branch is configured the same way.

You can call the activity derivedColumn1 — as shown below:

dataflow derived activity
  • After the derivedColumn1 activity above, click on the + sign close to your and add the sink activity. The sink is the destination we want to load the source data.
Dataflow sink activity

Because we want to load the data to blob storage and Azure SQL, we added one sink for each of the destinations from the conditional branch that was created above.

Before then, we need to create a new Table from the SQL database. The table will be called ProductArchive and will have the same table schema as the Product Table that was used as the source in the data flow.

To minimize the administrative effort in creating this ProductArchive table from scratch, you can run the below SQL query from the Azure SQL query editor:


SELECT
ProductID AS ProductArchiveID,
Name,
ProductNumber,
Color,
StandardCost,
ListPrice,
Size,
Weight,
ProductCategoryID,
ProductModelID,
SellStartDate,
SellEndDate,
DiscontinuedDate,
ThumbnailPhotoFileName,
rowguid,
ModifiedDate
INTO
dbo.ProductArchive
FROM
[SalesLT].[Product]
WHERE
1 = 2;

Similarly, you must create a dataset for your sink activity (destination storage) before you can load data. To this,

✅ On the first sink activity, click on the new dataset option provided in the sink settings of the sink activity. For the purpose of the project, we connected to the ProductArchive Table in the Azure SQL database created earlier

✅ On the second sink activity, click on the new dataset option provided in the sink settings of the sink activity. For the purpose of the project, we connected to the Blob storage and select the raw container blob storage that was created earlier. An example of this is shown in the screenshot below:

ADF sink activity

Check this reference on how to create a linked service in Azure Data Factory

If you follow all the data flow steps, the complete data flow should look like below:

ADF dataflow

Create a pipeline

In this step, we are going to add the SQL Dataflow created above as a data pipeline. To this,

  • Create a new pipeline called SQL Dataflow Pipeline
  • On the SQL Dataflow pipeline, add the Data Flow activity to the pipeline studio
  • Under Settings in the data flow activity, select the SQL Dataflow that was created above
ADF pipeline

Run/Trigger The Pipeline

Pipeline runs are typically dependent on the parameters that you define in the pipeline. To manually trigger a pipeline or configure a new scheduled, tumbling window, storage event, or custom event trigger, select Add trigger at the top of the pipeline editor and click on Trigger Now

run pipeline

If you choose to manually trigger the pipeline, it will execute immediately. Otherwise, if you choose New/Edit, you will be prompted with the add triggers window to either choose an existing trigger to edit or create a new trigger.

Reference — https://learn.microsoft.com/en-us/azure/data-factory/concepts-pipeline-execution-triggers

Monitor a Data Flow activity

After the pipeline is triggered, we can monitor the pipeline runs — which shows if the pipeline succeeded or failed to extract, transform or load the data.

To do this, open the monitoring experience, and select the Monitor & Manage tile in the data factory blade of the Azure portal. If you’re already in the ADF UX, click on the Monitor icon on the left sidebar. As shown in the image below.

ADF monitor pipeline

Reference — https://learn.microsoft.com/en-us/azure/data-factory/monitor-visually

Output

If your pipeline ran successfully, you should see the runs like the image above. After that, you can now check your destination data store to check if the data from the data pipeline was loaded on the Azure SQL table and the container storage account.

An example of the loaded data on the container storage for this project is shown below

ADF Pipeline container output

Depending on your next action, you can choose to download the CSV data from the container for visualization with your team ☺

✔ Source Control (Git) Integration

By default, the Azure Data Factory user interface experience (UX) authors directly against the data factory service. Because of this, there are many limitations like — the service not being fully optimized for collaboration and version control.

To provide a better authoring experience, Azure Data Factory allows you to configure a Git repository with either Azure Repos or GitHub. Git is a version control system that allows for easier change tracking and collaboration.

As best practices for ADF CI/CD, it is recommended to configure only your development data factory with Git integration. Changes to test and production are deployed via CI/CD and do not need Git integration.

There are four different ways to connect a Git repository to your data factory for both Azure Repos and GitHub. After you connect to a Git repository, you can view and manage your configuration in the management hub under Git configuration in the Source control section

In this project, the GitHub repository was configured from the Management page on the ADF UX. To do this, go to the Manage hub in the ADF UX. Select Git configuration in the Source control section. If you have no repository connected, click Configure. As shown below:

ADF Git Integration

The configuration of a git repository allows you to save changes, letting you only publish when you have tested your changes to your satisfaction.

You can check the Git repository for this project on my GitHub — https://github.com/sanctitygeorge/Azure-Data-Factory-Pipeline

References:

https://learn.microsoft.com/en-us/azure/data-factory/source-control

https://learn.microsoft.com/en-us/azure/data-factory/continuous-integration-delivery

--

--

Ifeanyi Ibem
Ifeanyi Ibem

Written by Ifeanyi Ibem

Automation Engineer (Infrastructure, Cloud , Data & AI), Technical Trainer & Writer

No responses yet