Leveraging MSSparkUtils for file operations in MS Fabric

April 22, 2024

Microsoft Fabric is an all-in-one analytics solution for enterprises, designed to help organizations create, use, and govern data insights effectively. Within the Fabric ecosystem, Spark integration enables developers to harness the power of Apache Spark for big data processing. Spark is a powerful open-source distributed computing framework that specializes in scalability, speed, fault tolerance, and a rich set of processing capabilities.

Microsoft Spark Utilities (MSSparkUtils) is a utility library developed by Microsoft for Apache Spark. It offers a variety of helper functions and utilities to simplify common tasks such as file system operations, managing Lakehouse artifacts, retrieving environment variables, running multiple notebooks concurrently, and handling secrets. In this article, we will explore some useful techniques for leveraging MSSparkUtils in Microsoft Fabric notebooks.

MSSparkUtils in Notebook

MSSparkUtils is a built-in utility available in Fabric Notebooks. You can access it using the mssparkutils prefix. Alternatively, you can import it with a preferred alias, such as msu.

AD 4nXdf wPnArQVGSloRLuRdUkUY0wY5znDvZHSMRBjtIfm Gn oBf5cZ0ukwEmOt0g2WmteKmDrr6O23WHtNFpLE4X1KsJblLRTZbW2qO7PMVe4eZ

Check the Available Methods

To explore the available methods in MSSparkUtils, you can list them as shown below: A screenshot of a computer program

Description automatically generated

Working with Lakehouses

A Microsoft Fabric Lakehouse is a unified data storage and processing architecture that combines the benefits of data lakes and data warehouses. Using MSSparkUtils in notebooks, you can easily create, delete, or list Lakehouses within a specific workspace.

Creating a New Lakehouse

To create a new Lakehouse, use the following command:

AD 4nXf4zcQRf3jv02oiqbZie8Wa s2mYkg9yGpn0GXGCIlJ3Ns5 gQxRXSMiwdlA8fdOz5V 2cpdEmma6707TP1EJFGwpM40fPd29pB24KHfv6UBOmBk5VTEuMTpN5xmg6vAb11NknZDSD 3tkur2q6 Bo?key=J6YOw65UtQE0dwRzHfScoA

AD 4nXcWUOqi9OKChZfw2wxC W 4EZuYhRMUbXQSI9zivAMk7peLRM6yKNi6q9D52XkLnJrrcGL

This command creates a Lakehouse named “Primary_Lakehouse” in the current workspace. If you want to create the Lakehouse in a different workspace, you need to provide the workspace ID as a parameter.

Listing Lakehouses in a Workspace

To check the number of Lakehouses available in a workspace, use the following code:

A black text on a white background

Description automatically generated

The output may appear cluttered. To simplify, let’s extract only the names of the lakehouses. Run the following code, and feel free to customize the parameters as needed. AD 4nXfHUrEooetQjAGATEXQu4cLbBH y2eHndjei5yz fETfy8CVZz 2ngCl IhIGpZyBHE8skgjaVnOAQf6GQMkoZqP9Sl7kGtOrJuKtxKkzaYrvv5io8O3p

Create a Directory in the Lakehouse

Now that we have created a lakehouse, the next step is to create directories under the default file section of the lakehouse. Directories are essential for organizing large volumes of data. MSSparkUtils provides an inbuilt method called mkdirs that we can use for this purpose. A white background with black lines

Description automatically generated

Copy Files to a Secondary Lakehouse

When working with data, it is often necessary to copy files from one lakehouse to another. MSSparkUtils simplifies this process by allowing you to copy files between lakehouses using notebooks. Below is an example of copying data from a primary lakehouse to a secondary lakehouse.

AD 4nXdcocGm3kLECNbz96eWfzKghWnAhUi9ssjjKTXhBsGDQkJu9DgouFS2km0BdT8w mYC7QeMvoKBKAofWobJRrM8lffD2B

Move Different Types of Files Using Custom Function

In our raw_files directory, we have two types of files. We will now move these files to the secondary_lakehouse based on their file types. To achieve this, we will create a custom function using MSSparkUtils. This function will allow us to quickly move files according to their types.

Moving Files from Local Lakehouse

AD 4nXfGGauGonVauIEqBVutFO4S8Fs mBrAuGfKk6LQjFHhfAS2duVB njG6lC017X4n3em7rdU03uq8i1U

Using ABFSS Path for Moving Files

If you are using an ABFSS path, try the following code:

AD 4nXeaUPx7K1aP4 rwUHiyQH3t1STx1BHDBJid3qs6HNUQoGbiO6HsdPo1Icg BqwcH9sJYd8nwj Ho2 VZ7JaROQX0FAkSdHlC1VEzvz4SKCOZD5KCjmNp0 IwzS1 L1UKrkULSUIgMuGq2a2IaM1pd0?key=J6YOw65UtQE0dwRzHfScoA

Our custom function is now ready. It accepts three parameters:

Type: The type of files you want to move.
file_info_list: A list of files in the source directory.
destination_directory: The directory where you want to move the files.

You can use the first code snippet for the current mount point and the second one for moving files using an ABFSS path.

Let’s call the function to move the files. Our goal is to move specific file types one at a time.

AD 4nXf68OnnAV9utdmcFmeDbbZILge7dqilmWbevRyev3168PxaKifBNvOMRCtg5jg0ldSvQLK8TJS55F7o 2RJCFvdGtZ09yMQaZxHjjTmW2ul02dRFh7Xm8kA M5IkBsrtgEI2Fs9M2gr4FJqVnmG 1Q?key=J6YOw65UtQE0dwRzHfScoA

A computer screen shot of a computer code

Description automatically generated

AD 4nXeWniVMHgzEMVpqkK4jHFY3cIsH13r9 FZhHUQyUa7yxxt8F9zee2D HXK1 gZazxc0Eb2sZGSIzvKwVE3 UEXn63MaNZBAa2ZkKpd AKekTdI9sFAwt bMsq8M Fiw03ThQPOhUUGw9N tCbUv08M?key=J6YOw65UtQE0dwRzHfScoA

Finally, we have successfully moved the JSON files to the destination folder.

Copy (cp) vs Fast Copy (Fast cp)

Earlier in this article, we discussed two methods for copying files using MSSparkUtils: cp and fastcp. The fastcp method is designed for high-performance copying of large volumes of data, making it the preferred choice for bulk operations.

AD 4nXfZBSq3wbl70wf4WmQn4OlXMhVJBKPZlRsftybpHzJbGaGiBlG9RhBgHQEQ5YB8HptyHHWUksEaAkNf 1F zqcBFndmHZp2i0eqOZ0RztapNr3HXpuWWbBYFD9fNr4dQtXLpoheoPSq8EFEvplbbA?key=J6YOw65UtQE0dwRzHfScoA

Conclusion

In conclusion, MSSparkUtils’ Lakehouse utilities significantly streamline the management of Lakehouse artifacts. By integrating these tools into your Fabric pipelines, you can enhance your data management workflows and improve efficiency.Explore more insights and tips by checking out our other articles on Microsoft Fabric here.

Ikramul Islam

Associate Analytics Engineer • Power BI & Governance

Ikramul, A simple guy, has been working as a data professional for around three years, aiming to meet the gap between data and business. He is skilled in Power BI and Microsoft Fabric. With a keen interest in data privacy and governance, he excels in data management and visualization. Besides his regular projects, he works at the backend R&D and prepares demos for spreading knowledge. Ikramul enjoys automating his tasks to make processes more efficient, combining his technical skills with a passion for innovation.