6 min read

Data Integration and Orchestration with Microsoft Fabric.

Data integration and orchestration are critical in modern analytics and can lead to inefficiencies. In this article, we design, implement, and monitor seamless data workflows using Microsoft Fabric.

Types of Integration Architectures in Fabric

Batch Processing

Data is processed in large chunks and used when data freshness is not a primary concern.

Fabric supports batch processing through Dataflow Gen2 and Data Pipelines.

Real-Time Streaming

Data ingestion happens as soon as it arrives.

Fabric supports real-time streaming using Event Triggers, Dataflow Gen2 with incremental updates and Real-Time Hub.

Hybrid Integration

Data from on-prem and cloud-based sources is combined into one workflow.

Fabric supports hybrid integrations using Connectors, Data Gateways and Lakehouse.

On a tangent: Azure Data Factory vs Fabric Data Factory

Azure Data Factory and Fabric Data Factory both serve the same purpose: move data between source and destination. They both have:

  • ELT/ETL Capabilities
  • Many connectors
  • Data Flows
  • Pipelines

However, there are features unique to each one.

Same-same while not being same-same.

Which is the right tool for your needs?

The million-dollar question. With so many choices, which should we choose?

Difference between Integration and Orchestration

There is a difference between 'integration' and 'orchestration'.

Integration moves and transforms data from various sources to a destination(s).

Typically, Dataflow Gen2 and Data Pipelines are enough for integrations.

Core integration components

Orchestration, on the other hand, coordinates and automates multiple tasks, ensuring they execute in a prescribed sequence and under the right conditions.

In addition to pipelines and dataflows, triggers and dependencies are also needed for a proper orchestrated workflow

Demo: Loading data with Pipelines

Select Data pipeline
Name it whatever makes sense
Select Copy data for Pipeline activity
Choose More
💡
This demo is getting data from an online source.
Search for the term 'http'
https://raw.githubusercontent.com/MicrosoftLearning/dp-data/main/orders.csv
💡
The 'Allow this connection to be utilized with either on-premises or VNet data gateways' flag allows users to integrate data sources without being constrained by the physical or network location of their data.
In this demo, we are connecting to an HTTP online source, therefore, there is no need to allow on-premise connections.
The data is in csv format and File Format must be set to DelimitedText; Also Test connection.
Set the connection to data source (TestLakehouse in this demo); The lakehouse does not have any tables in it and we will create a New one.
The table is named orders.
If there is no schema shown, click Import schemas...
... to import schema for the data source
Modify data types for source columns (if needed)
Leave Settings tab values as they are
Validate the pipeline
Yayyyy !!!!
Put the pedal to the metal (or some other expression of let's go).
The output console shows Succeeded
Go back to the Lakehouse, expand Tables, and you will see orders. Click on orders and data should load in preview mode.
👺
Sometimes, due to caching, you may see a folder titled Unidentified. Refresh the lakehouse screen, and this folder should disappear, replaced by orders.

Demo: Loading data with DataFlows Gen2

Delete the order table that was created in the last demo.

Let's start from scratch
Select Dataflow Gen2
For now, leave Enable Git Integration check box as is. Look out for future article on CI/CD and Fabric.
We will use the same csv file that was used in the previous demo.
Configure connection
Dataflow when created displays a preview of the source data

On successfully creating the Dataflow, the main console for Dataflow will appear.

Notice the stars and the hand?

Dataflow changed column types as the data was being investigated. The Power Query syntax is provided as one step in the list of Applied steps.

Additionally, notice that no Data destination has been configured for this Dataflow yet.

Let's set a destination, then.

Select Lakehouse
A connection to available lakehouses is created.
Confirm the destination is TestLakehouse and a new table named orders.
Save settings.
☠️
Why was Use automatic settings turned off?

While Fabric Data Factory's "automatic settings" for data ingestion into a lakehouse can be convenient and efficient for many scenarios, there are several situations where relying solely on them can lead to undesirable outcomes or missed opportunities for optimization.

# 1: If the source system schema changes frequently.
#2: When precise data types are crucial for downstream analysis. For example, financial data might require specific decimal precision.
#3: When optimal partitioning and distribution strategies for lakehouse tables must be configured manually.
#4: If raw data needs cleaning, transformation and enrichment.
#5: When source data contains complex structures that need to be transformed/flattened before saving.
Publish the Dataflow.

At this point, the core component (dataflow) that will be used for data ingestion has been created. We will now wrap a pipeline around the Dataflow component to finish our demo.

Create a new pipeline.
Select the Dataflow option.

The main pipeline authoring panel will appear (as indicated in the image below).

The pipeline editor is displayed, a Dataflow 'shell' is placed in the editing space, and we have a drop-down showing all our Dataflow components.. we will choose dataflow.

Validate and finally Run the pipeline.

Success again !

We can confirm dataflow worked as we had hoped by going to the lakehouse, and looking at the orders table (along with its data, of course).

Squint your eyes a little here.... notice the top left corner and Tables>orders.

What's the difference, then, between a Pipeline and a Dataflow?

Its not one or the other; its a 'better together' story.

I write to remember, and if, in the process, I can help someone learn about Containers, Orchestration (Docker Compose, Kubernetes), GitOps, DevSecOps, VR/AR, Architecture, and Data Management, that is just icing on the cake.