Optimizing performance and cost in Microsoft Fabric Lakehouse

December 16, 2023

Microsoft Fabric’s Lakehouse platform leverages Delta Lake for its tables, an open-source format that brings versioning and ACID capabilities to the Lakehouse. Since storage is a critical component of any data architecture, understanding the best practices to enhance performance and reduce costs is essential. This blog delves into the Delta Lake format and explores various strategies to optimize its performance and cost-effectiveness.

Understanding Delta Lake

Delta Lake uses Parquet files, a columnar storage format, to store data. In addition to Parquet files, Delta Lake employs JSON files to store transaction logs and metadata. This metadata provides the capability of versioning data as well as statistics that improves the read query performance.

A Delta table consists of multiple Parquet files, which can point to different versions of data accumulated over time. For instance, a daily data pipeline updating tables will result in seven versions of data over a week.

For every commit to the delta table, a new JSON file is created storing the version information and the names of the parquet files that correspond to the version. A checkpoint file in parquet format is automatically generated as needed within _delta_log and it contains the aggregated transactions which allow faster reading of transaction logs.

Optimizing Performance

OPTIMIZE Command

When data is written to a table, multiple Parquet files are created. Over time, reading data from numerous small Parquet files can lead to performance issues, commonly known as the “small file problem.” The OPTIMIZE command addresses this by coalescing multiple small Parquet files into fewer, larger files, significantly improving performance in many cases.

For Delta tables frequently updated with small batch SQL commands, the optimizeWrite option in Spark or as a table property can reduce the number of small files generated.

The OPTIMIZE operation is idempotent, meaning running it consecutively without new commits offers no additional benefits.

Z-Ordering

Based on the input column(s) given, Z-ordering rearranges the relevant data closer together on the disk. This particularly boosts performance for queries that filter on the columns of Z-order as it helps to skip irrelevant data faster. However, performance gains diminish with an increasing number of input columns, so it’s advisable to limit the number of columns used for Z-Ordering.

V-Ordering

V-Ordering applies data operations like sorting and compression, improving read query throughput for Power BI and SQL engines utilizing Microsoft Verti-scan technology. Non-Verti-scan compute engines also benefit from this option.

V-Ordering is enabled by default in Microsoft Fabric. When using Apache Spark within Fabric, write commits automatically use V-Ordering unless explicitly disabled in the Spark session or table. V-Ordering is independent of Z-Ordering, allowing both operations to be applied without negative impacts.

Managing Storage Costs

VACUUM Operation

As mentioned earlier in the blog, Delta Lake stores multiple versions of data, which can accumulate over time, increasing storage costs. For instance, maintaining versions for hundreds of tables over a month can lead to significant storage expenses. The VACUUM operation safely removes older versions of data. By default, it retains data for seven days, removing versions older than that.

Automating Maintenance Operations

The operations—OPTIMIZE, V-ORDER, Z-ORDER, and VACUUM—are essential table maintenance tasks that should be performed periodically to ensure optimized performance and cost-efficiency. Fabric provides a GUI option in the context menu of each table to run these operations ad-hoc. However, manually running these operations for every table can be cumbersome.

The following PySpark code automates these operations through scheduled notebook execution. While it’s unnecessary to run this code after every pipeline run, scheduling it periodically based on the frequency of commits and pipeline runs can be beneficial.

from delta.tables import *
# Table names and their corresponding Z-Order Input columns as dictionary
table_zorder = {
    “Sales.public_holidays”:[“countryOrRegion”, “holidayName”],
    “Sales.Sales”:[“Region”]
}

# For each table, OPTIMIZE, Z-ORDER AND VACUUM operations are applied
for name in table_zorder:
    deltaTable = DeltaTable.forName(spark, name)
    deltaTable.vacuum()
    deltaTable.optimize().executeCompaction()
    deltaTable.optimize().executeZOrderBy(table_zorder[name])

This code uses a table_zorder dictionary to loop through tables and their Z-Order input columns, performing the necessary operations. The dictionary can be generated from a configuration file read by Spark. Note that V-ORDER is not explicitly included in the code because it is automatically applied if enabled for the table or session.

Conclusion

By leveraging the OPTIMIZE, Z-ORDER, V-ORDER, and VACUUM operations on Delta Lake, you can significantly enhance the performance and cost-efficiency of Delta tables. While these operations introduce a slight overhead during writes, the substantial improvements in read performance make them invaluable for analytical workloads. Implementing these strategies ensures your Lakehouse remains both performant and cost-effective.

Govindarajan D

Data Architect

Govindarajan D is a seasoned Data Architect, leveraging expertise in exploring Microsoft technologies. With a passion for innovative solutions, Govindarajan is not only skilled in crafting robust data architecture but also holds the title of Microsoft Certified Trainer (MCT), ensuring cutting-edge knowledge transfer and continuous learning within the tech community.