Welcome to this Beginner’s Guide to Azure Data Factory! In this series, I’m going to cover the fundamentals of Azure Data Factory in casual, bite-sized blog posts that you can read through at your own pace and reference later. You may not be new to ETL, data integration, Azure, or SQL, but we’re going to start completely from scratch when it comes to Azure Data Factory.
How do you get started building data pipelines? What if you need to transform or re-shape data? How do you schedule and monitor your data pipelines? Can you make your solution dynamic and reusable? Join me in this Beginner’s Guide to Azure Data Factory to learn all of these things - and maybe more. 🤓 Let’s go!
P.S. This series will always be a work-in-progress. Yes, always. Azure changes often, so I keep coming back to tweak, update, and improve content. I just might not be able to do it right away!
In the previous post, we used the Copy Data Tool to copy a file from our demo dataset to our data lake. The Copy Data Tool created all the factory resources for us: pipelines, activities, datasets, and linked services.
In this post, we will go through pipelines in more detail. How do we create and organize them? What are their main properties? Can we edit them without using the graphical user interface?
How do I create pipelines?
So far, we have created a pipeline by using the Copy Data Tool. There are several other ways to create a pipeline.
On the Home page, click on the New → Pipeline dropdown menu, or click on the Orchestrate shortcut tile:
In the previous post, we went through Azure Data Factory pipelines in more detail. In this post, we will dig into the copy data activity. How does it work? How do you configure the settings? And how can you optimize performance while keeping costs down?
Copy Data Activity
The copy data activity is the core (*) activity in Azure Data Factory.
(*Cathrine’s opinion 🤓)
You can copy data to and from more than 90 Software-as-a-Service (SaaS) applications (such as Dynamics 365 and Salesforce), on-premises data stores (such as SQL Server and Oracle), and cloud data stores (such as Azure SQL Database and Amazon S3). During copying, you can define and map columns implicitly or explicitly, convert file formats, and even zip and unzip files – all in one task.
In the previous post, we looked at the copy data activity and saw how the source and sink properties changed with the datasets used. In this post, we will take a closer look at some common datasets and their properties.
Let’s start with the source and sink datasets we created in the copy data wizard!
Dataset Names
First, a quick note. If you use the copy data tool, you can change the dataset names by clicking the edit button on the summary page…
In the previous post, we looked at datasets and their properties. In this post, we will look at linked services in more detail. How do you configure them? What are the authentication options for Azure services? And how do you securely store your credentials?
Let’s start by creating a linked service to an Azure SQL Database. Yep, that linked service you saw screenshots of in the previous post. Mhm, the one I sneakily created already so I could explain using datasets as a bridge to linked services. That one 😅
Pssst! Linked Services have been moved into the management page. I'm working on updating the descriptions and screenshots, thank you for your understanding and patience 😊
So far in this Azure Data Factory series, we have looked at copying data. We have created pipelines, copy data activities, datasets, and linked services. In this post, we will peek at the second part of the data integration story: using data flows for transforming data.
But first, I need to make a confession. And it’s slightly embarrassing…
I don’t use data flows enough to keep up with all the changes and new features 😳
Don’t get me wrong. I want to! I really, really, really want to. But since I don’t currently use data flows on a daily basis, I struggle to find time to sit down and dig into all the cool new things.
So! In this blog post, I will mostly scratch the surface of data flows, then refer to awesome people with excellent resources so you can learn all the details from them.