We are the Best Consulting web site as part of the annual WebAward Competition!

(832) 981-4635
info@datacrafters.io
img
Language

English

Leveraging MSSparkUtils for file operations in MS Fabric

Ikramul Islam

Ikramul Islam

Associate Data Analyst, Data Crafters

View Author Profile

Microsoft Fabric is an all-in-one analytics solution for enterprises that helps you create, use, and govern data insights across your organization. In the context of Fabric, Spark integration enables developers to leverage Spark’s capabilities for big data processing within the Fabric ecosystem.

Spark is a powerful open-source distributed computing framework that specializes in scalability, speed, fault tolerance, and a rich set of processing capabilities.

Microsoft Spark Utilities (MSSparkUtils) is a utility library for Apache Spark developed by Microsoft. It provides various helper functions and utilities to simplify common tasks like working with the file system, managing Lakehouse artifacts, getting environment variables, running multiple notebooks together, and working with secrets. In this article, I will walk you through some tricks that you can leverage when working with the notebook in Microsoft Fabric.

MSSparkUtils in Notebook

MSSParkutils is a built-in utility available in Fabric Notebook. You can simply use it with the mssparkutils prefix. You can also import it with your preferred alias like msu.

Check the available methods

Playing with Lakehouse

A Microsoft Fabric Lakehouse is a unified data storage and processing architecture that brings together the benefits of data lakes and data warehouses. Now leveraging notebook and MSSparkUtils you can easily create, delete, or check the list of a Lakehouse in a particular workspace.

To Create a new Lakehouse you must write down the following command.

This creates a Lakehouse named “Primary_Lakehouse” in the current workspace you are working on. If you want to create the Lakehouse in a different workspace then you need to pass the workspace ID as a parameter.

Let’s check the number of Lakehouses available in a workspace.

It looks messy! Doesn’t it?  Let’s get the name only. Just Run the following code. You can certainly customize your parameters.

Create a directory in the Lakehouse

So far, we have created a Lakehouse, now we can create directories under the default file section of the Lakehouse. Directories will help you a lot to manage tons of data. As you can see there is an inbuilt method named ‘mkdirs’ that we can exploit here.

Copy files to a Secondary Lakehouse

It is often required to copy data from one Lakehouse to another Lakehouse while playing with your data. MSSparkUtils can give you a good hand copying files from one Lakehouse to another Lakehouse using notebooks. Let’s check that out.

We are copying data from our primary Lakehouse to our secondary Lakehouse.

Move Different Types of Files Using Custom Function

You can see we have two types of files in our raw_files directory. Now we are going to store the file in secondary_lakehouse according to their types. We will do it by building a custom function using MSSparkUtils so that we can use it very quickly. Let’s jump in and type the code below.

Moving Files from Local Lakehouse

Try the below one when you are using abfss path.

Now our custom function is ready. It takes three parameters. Type – Which type of files you want to move, file_info_list – list of the files in the source directory, destination_directory – the directory where you want to move your file. You can use the first one for the current mount point and the second one if you want to move files using abfss path.

Let’s call the function and move the files. Our goal is to move a certain type of file at a time.

Finally, we can move our Json files to the destination folder.

Copy (cp) vs Fast Copy (fastcp)

You have already seen that there are two types of methods (cp and fastcp) to copy files using MSSparkUtils in the first part of this article. The performant copy of the files (fastcp) gives a faster way to copy large volumes of data.

Finally, to conclude, utilizing MSSparkUtils’ Lakehouse utilities streamlines the management of your Lakehouse artifacts, and you can seamlessly integrate them into your Fabric pipelines and enrich your data management workflow.

Post A Comment

Stay ahead in a rapidly world. Subscribe to Prysm Insights,our monthly look at the critical issues facing global business.

[mc4wp_form]