We are the Best Consulting web site as part of the annual WebAward Competition!

(832) 981-4635
info@datacrafters.io
img
Language

English

Data Engineering in Microsoft Fabric: Part 2 – Ingestion – Dataflow Gen2

Rejaul Islam

Rejaul Islam

Data Analyst, Data Crafters

View Author Profile

Dataflows is a collection of tables that perform data transformation in Power BI services independently. It leverages the Power Query engine with a common data model to do the required transformations and make it available for reuse by different dataflows and the Power BI semantic model (previously known as a dataset).

Transformation is essential not just for data cleansing but also for effectively handling extensive datasets. According to Matthew Roche from Microsoft’s Power BI CAT team, “Data should be transformed as far upstream as possible, and as far downstream as necessary.” In a simple word, if you can transform your data at its source or outside the Power BI semantic model, you should go for it. By reducing the burden that Power BI must handle on its own, this method maximizes effectiveness and efficiency.

Dataflows, introduced in 2019, have evolved with the release of Dataflows Gen2 in Q1 2024 within the Microsoft Fabric ecosystem. While both are powered by Power Query, the major difference is we can define the destinations for dataflow gen2, detailed in the official Microsoft documentation here.

 

Why Choose Dataflow Gen2?

  • Simplified Authoring: Enjoy a streamlined process with shorter steps and ongoing feature enhancements for an improved user experience.
  • Autosave Feature: Keep your work safe with autosaved drafts, enabling uninterrupted development without needing to publish dataflows.
  • Enhanced Data Destinations: Utilize data destinations to store processed data in Lakehouse, Warehouse, Azure SQL Database, or Azure Data Explorer (Kusto).
  • Pipeline Integration: Seamlessly execute dataflow-related actions directly from Pipelines, enhancing workflow efficiency.
  • Improved Performance: Benefit from an enhanced computing engine, optimizing transformations and data retrieval with Lakehouse and Warehouse items in your workspace.

Despite the data factory currently comprising only two primary properties, namely Dataflow Gen2 and Pipeline, here our focus is exclusively on the creation of Dataflows Gen2 dataflow. We will guide you through the step-by-step process of harnessing the robust capabilities offered by Dataflows Gen2, shedding light on its potential within the Microsoft Fabric ecosystem.

 

Create Workspace

To create Gen2 dataflows, you must have a fabric capacity workspace. Begin by accessing the Power BI service and creating a workspace with a fabric trial. In this example, we named our workspace Practice (1), then selected the license type trial (2), and finally clicked on Apply (3).

create workspace in power bi service


Create Dataflow Gen2

Belongs to Data Factory tab, we can navigate the data factory in two ways.

Method 1: We have a Power BI (1) icon in the bottom right corner. Click that icon and then click Data Factory (2).

create data factory in microsoft fabric


Method 2:
In the top right corner click New (1) then click on More Options (2).

create data factory in fabric method 2

Then you will see all services under fabric. From there, under Data Factory, click Dataflow Gen2 (3).


Get Data

create data factory in fabric method 2

Now, let us begin importing data into your dataflow. From here, the actual extract, transform, and load (ETL) process will start. In this instance, we will extract data from a CSV dataset. To do so, from home navigation click Get Data (1) then click Text/CSV (2) or click on Import from a Text/CSV file (3) displayed on the initial preview page.

Get Data in Dataflow gen2

Now, the following connection window will open, and you need to pass a few pieces of information accordingly.

Here,

1.     Path: https://raw.githubusercontent.com/MicrosoftLearning/dp-data/main/sales.csv

2.     Connection: Create a new connection

3.     Data gateway: none

4.     Authentication kind: Anonymous

5.     Privacy Level: None

6.     Action: Click Next

Now the data preview window will open, and you do not have to do anything click the “Create” button.

preview of data


A Space for Data Transformation

If you are acquainted with Power Query, this interface will feel like home. Here, you will find all the functionalities, akin to a cloud-based version of the Power Query engine. Let us take a brief look at the Power Query Editor.

overview of power query editor


Create Custom Columns

Gross Sales: Take a glance at our dataset containing Quantity and UnitPrice. Multiplying these two columns yields the Gross Sales value. To achieve this, from top navigation click on Add Column -> Custom Column and perform the following calculations.

Here,

1.     Navigation Bar: Offers various features to simplify and enhance data transformation.

2.    Navigation Details: Provides insights into active or selected navigation items.

3.     Queries: Lists all queries or tables available for manipulation.

4.    Flow Diagrams: Aids in understanding the modifications made within the data flow; to disable, click View -> Diagram View.

5.    Formula Bar: Displays the last applied steps in M-code by default; clicking on specific applied steps reveals them here.

6.     Data Preview: Displays sample data after applying the steps’ formulae.

7.     Applied Steps: Records changes as new activities, allowing flexibility to revisit any applied step as needed.

custom column using power query, m-dax

Here,

1.     Column Name: Gross Sales

2.     Formula: [Quantity] * [UnitPrice]

Net Sales: Now we have calculated Gross Sales, subtracting TaxAmount from it gives us the Net Sales amount. To proceed, again navigate to Add Column -> Custom Column and execute the following calculations.

custom column in power query, m-dax net sales

Here

1.     Column Name: Net Sales

2.    Formula: [Gross Sales] – [TaxAmount]


Adjusting Data Types

After creating the new columns for Gross Sales and Net Sales, it is essential to ensure they have the correct data types. Be sure to set the appropriate data type for each column to maintain data integrity and accuracy.

Since these columns represent currency or decimal data, it is advisable to set the data type to “Decimal number” for both. Simply click on the marked datatype and choose “Decimal number” to ensure an accurate representation of the data.

adjusting decimal data type


Final Processed Data


Create a Lakehouse

 We are going to save the data within a Lakehouse. First, we need to create a Lakehouse as a destination to store these data correctly. If you already have a Lakehouse, you can use it or you can follow these instructions to create a Lakehouse. To do that, click Power BI (1) then click Data Engineering (2), and go to your workspace where you would like to create Lakehouse.

create a lakehouse in microsoft fabric

Under the Practice workspace click New (1) then  click on the Lakehouse (2)

another way to create a lakehouse in fabric

Give a meaningful name to your new lakehouse and click Create.


Add a Destination

Now, go back to your dataflow and from the query settings, we have an option called Data Destination. Click the (+) icon then select Lakehouse to save processed data into a Lakehouse.

A dialog screen will open you don’t have to do anything just click next.

Now we need to define the destination path.

add a destination in dataflow gen2
set destination target in dataflow gen2

Here,

1.     Here you can set the name of the table that is going to be created in the Lakehouse.

2.     Under the display options Practice is our workspace and you need to select the destination Lakehouse (TestLH)

3.     Finally click Next.

A destination setting window will open, and by default, fabric will automatically detect the source table its column type and create the destination table column and its type.


Rename Dataflow

Before publishing the dataflow, it is important to assign a meaningful name. To do this, click on Dataflow1 (1) at the top, then enter Sales DF (2) as the new name for the dataflow.

rename dataflow


Publish Dataflow

Now we have all the required transformations, to make all the changes effective we need to publish the dataflow. For that, click Publish shown at the bottom right corner of the window. Dataflow will validate every change and at the end, it will refresh once the dataflow is by itself.

Publish dataflow

Once the dataflow refresh is completed, we will find a new table (sales) under testLH.

dataflow in lakehouse

In wrapping up, let us talk about what we have discovered with Dataflow Gen2. It is awesome how it handles all those steps needed to tweak your data model and the best part? Doing these tweaks in Gen2 is super easy. It fits right into your data pipelines and works smoothly with lakehouses/warehouses. Plus, it is great for separating processing time from your semantic model, which helps Power BI Report/dashboard to get business insights faster.

Post A Comment

Stay ahead in a rapidly world. Subscribe to Prysm Insights,our monthly look at the critical issues facing global business.

[mc4wp_form]