Microsoft Fabric is an all-in-one analytics solution for enterprises that helps you create, use, and govern data insights across your organization. In the context of Fabric, Spark integration enables developers to leverage Spark’s capabilities for big data processing within the Fabric ecosystem.
Spark is a powerful open-source distributed computing framework that specializes in scalability, speed, fault tolerance, and a rich set of processing capabilities.
Microsoft Spark Utilities (MSSparkUtils) is a utility library for Apache Spark developed by Microsoft. It provides various helper functions and utilities to simplify common tasks like working with the file system, managing Lakehouse artifacts, getting environment variables, running multiple notebooks together, and working with secrets. In this article, I will walk you through some tricks that you can leverage when working with the notebook in Microsoft Fabric.
MSSParkutils is a built-in utility available in Fabric Notebook. You can simply use it with the mssparkutils prefix. You can also import it with your preferred alias like msu.
A Microsoft Fabric Lakehouse is a unified data storage and processing architecture that brings together the benefits of data lakes and data warehouses. Now leveraging notebook and MSSparkUtils you can easily create, delete, or check the list of a Lakehouse in a particular workspace.
To Create a new Lakehouse you must write down the following command.
This creates a Lakehouse named “Primary_Lakehouse” in the current workspace you are working on. If you want to create the Lakehouse in a different workspace then you need to pass the workspace ID as a parameter.
Let’s check the number of Lakehouses available in a workspace.
It looks messy! Doesn’t it? Let’s get the name only. Just Run the following code. You can certainly customize your parameters.
So far, we have created a Lakehouse, now we can create directories under the default file section of the Lakehouse. Directories will help you a lot to manage tons of data. As you can see there is an inbuilt method named ‘mkdirs’ that we can exploit here.
It is often required to copy data from one Lakehouse to another Lakehouse while playing with your data. MSSparkUtils can give you a good hand copying files from one Lakehouse to another Lakehouse using notebooks. Let’s check that out.
We are copying data from our primary Lakehouse to our secondary Lakehouse.
You can see we have two types of files in our raw_files directory. Now we are going to store the file in secondary_lakehouse according to their types. We will do it by building a custom function using MSSparkUtils so that we can use it very quickly. Let’s jump in and type the code below.
Try the below one when you are using abfss path.
Now our custom function is ready. It takes three parameters. Type – Which type of files you want to move, file_info_list – list of the files in the source directory, destination_directory – the directory where you want to move your file. You can use the first one for the current mount point and the second one if you want to move files using abfss path.
Let’s call the function and move the files. Our goal is to move a certain type of file at a time.
Finally, we can move our Json files to the destination folder.
You have already seen that there are two types of methods (cp and fastcp) to copy files using MSSparkUtils in the first part of this article. The performant copy of the files (fastcp) gives a faster way to copy large volumes of data.
Finally, to conclude, utilizing MSSparkUtils’ Lakehouse utilities streamlines the management of your Lakehouse artifacts, and you can seamlessly integrate them into your Fabric pipelines and enrich your data management workflow.
Stay ahead in a rapidly world. Subscribe to Prysm Insights,our monthly look at the critical issues facing global business.
© 2024 Data Crafters | All rights reserved