We are the Best Consulting web site as part of the annual WebAward Competition!

(832) 981-4635
info@datacrafters.io
img
Language

English

Optimizing performance and cost in Microsoft Fabric Lakehouse

Microsoft Fabric’s Lakehouse platform use Delta Lake for its tables, an open-source format that enables versioning and ACID capabilities to Lakehouse. Since storage forms the most important part of a data architecture, it is essential to understand the practices that could improve performance and reduce costs. This blog aims to explain the concept of Delta format and explore the different practices that are associated with it.

Delta Lake

Delta Lake format uses parquet files, a columnar storage format for storing data and in addition to parquet files, it uses JSON files for storing transaction log and related metadata. This metadata provides the capability of versioning data as well as statistics that improves the read query performance.

DeltaLake

A delta table consists of multiple parquet files, that can point to different versions of data that are accumulated over time. For example, a data pipeline that updates the tables once per day will cause 7 versions of data to be formed over a week.

For every commit to the delta table, a new JSON file is created storing the version information and the names of the parquet files that correspond to the version. A checkpoint file in parquet format is automatically generated as needed within _delta_log and it contains the aggregated transactions which allow faster reading of transaction logs.

Optimize

When data is written to table, multiple parquet files would be created, and over time reading data from multiple small parquet files causes performance issues (‘small file problem’). OPTIMIZE command provides the option to coalesce multiple small parquet files into fewer but larger files which improves performance significantly in some cases.

Delta tables that are frequently committed using small batch SQL commands can use ‘optimizeWrite’ option in Spark or as a Table property, which reduces the number of small files generated.

OPTIMIZE operation is idempotent and there is no additional benefit gained by running the command two times consecutively without any new commits.

Z-Order

Based on the input column(s) given, Z-ordering rearranges the relevant data closer to together on disk. This particularly boosts performance for queries that filter on the columns of Z-order as it helps to skip irrelevant data faster. The more the number of input columns, the lesser will be the performance gains, so it is recommended to have only few input columns.

V-Order

V-Order applies data operations such as sorting and compression which improves the throughput of read queries of Power BI and SQL engines which make use of Microsoft Verti-scan technology. Non-Verti-scan compute engines also seem to gain performance through this option.

V-Order is by default enabled in Microsoft Fabric, and when using Apache Spark within Fabric, the write commits use V-Order unless explicitly disabled in the spark session or the table. V-Order is independent of Z-Order and hence both operations can be applied without negatively affecting the other.

Vacuum

As mentioned earlier in the blog, Delta Lake stores multiple versions of data and over time, it accumulates storage space increasing storage costs. Versioning for few hundred tables for over a month can lead to huge bills just for storage costs. Older versions of data can be safely removed by applying the VACUUM operation.  The default retention period is 7 days and so, VACUUM operation removes versions that are older than 7 days.

The operations – OPTIMIZE, V-ORDER, Z-ORDER and VACUUM are table maintenance operations that need to be run periodically for optimized performance and costs.

Fabric provides a GUI option in the context menu of each table to run adhoc operations – OPTIMIZE, V-ORDER and VACUUM

But running this operation manually for every table periodically is cumbersome.

The following PySpark code can be used to automate it through scheduling the notebook. It is not necessary to run this code after every pipeline run, but based on the frequency of runs and the amount of commits it makes to the tables, it can be made to run on a periodic basis.

				
					from delta.tables import *
# Table names and their corresponding Z-Order Input columns as dictionary
table_zorder = {
    "Sales.public_holidays":["countryOrRegion", "holidayName"],
    "Sales.Sales":["Region"]
}

# For each table, OPTIMIZE, Z-ORDER AND VACUUM operations are applied
for name in table_zorder:
    deltaTable = DeltaTable.forName(spark, name)
    deltaTable.vacuum()
    deltaTable.optimize().executeCompaction()
    deltaTable.optimize().executeZOrderBy(table_zorder[name])

				
			

The above code uses table_zorder dictionary to loop through the tables and the Z-Order input columns for performing the operations. This dictionary can also be formed by having a config file that is read by Spark. There is no V-ORDER function in the above code because if V-ORDER is enabled for the table or in the session, then automatically the files are optimized with V-ORDER.

By using the above-mentioned operations on delta lake, the performance and cost of delta tables can be optimized to a great extent. While implementing the operations, there is a small overhead during write but considering the improvement during read, it is worthwhile for analytical workloads.

Post A Comment

Stay ahead in a rapidly world. Subscribe to Prysm Insights,our monthly look at the critical issues facing global business.

[mc4wp_form]