Navigating Data Storage in Microsoft Fabric Lakehouse and Warehouse

December 12, 2023

Microsoft Fabric offers two storage-compute platforms for batch data processing: Lakehouse and Warehouse. These platforms enable organizations to store and query data, which can then be utilized across other components of Fabric. Both platforms are seamlessly integrated with the broader Fabric ecosystem, serving as central hubs for data storage. While they share some similarities, they are designed for different use cases within an enterprise data architecture. This article explores the distinct characteristics and interaction methods of Lakehouse and Warehouse.

Lakehouse

The Fabric Lakehouse platform is designed for storing and analyzing both structured and unstructured data. It integrates tightly with other Fabric components, simplifying workflows. Lakehouse tables use the Delta format, which is built on top of Parquet files. For every Lakehouse, a default read-only SQL endpoint and a default semantic model are automatically created, enabling seamless consumption within Power BI.

The Lakehouse provides two ways to interact with the data:

Lakehouse view– Provides the ability to interact with the tables at file level granularity and getting its properties like ABFSS path. It also provides option to do maintenance activities such as OPTIMIZE (improves performance of Delta tables) and VACCUM (removes historical snapshots of data)

Default SQL Analytics endpoint view –Provides the option to query the data using T-SQL and build views on top of tables. This is read-only endpoint and as such, does not provide any option to create new tables, insert data or delete data. It has Visual Query Builder, similar to Power BI Datamart.

Lakehouse is predominantly geared towards users familiar with Apache Spark for processing data.

Warehouse

The Fabric Warehouse platform is designed for storing, processing, and analyzing data primarily through T-SQL. It supports creating tables, inserting data, and deleting data using T-SQL, making it well-suited for workflows involving dbt (data build tool). Like Lakehouse, Warehouse stores tables in the Delta format. However, Warehouse abstracts the folder and file details of a table, providing a user-friendly environment similar to Dedicated Synapse SQL. A default semantic model is also created for each Warehouse, similar to Lakehouse.

The Warehouse user interface is similar to the SQL Analytics endpoint in Lakehouse, with minor differences such as the ability to generate code for Dropping Tables and Create Table AS (CTAS).

Warehouse simplifies data interaction using T-SQL, though it currently has some limitations (e.g., TRUNCATE TABLE is not supported, but data can be deleted using DELETE FROM without a filter). These limitations may be addressed in future updates (Click here to learn more).

Warehouse is suitable for SQL Analysts who are familiar with processing data predominantly using SQL.

Comparison

The primary distinction between Warehouse and Lakehouse lies in their interaction methods. While both systems store data in the Delta format, they differ in the features of their interactive engines.

Internally, both Lakehouse and Warehouse share a common architecture for processing. The execution plan for the T-SQL query is passed down to Backend compute pools, which interact with the data stored in OneLake.

A key benefit of Fabric is its ability to facilitate cross-queries between Lakehouse and Warehouse as if they were part of the same system. Lakehouse and Warehouse components can be combined with each other, allowing them to be queried collectively using T-SQL. This interoperability allows developers and users to process data based on their specific requirements without duplicating data across components.

Both Warehouse and Lakehouse are first-class objects in data pipelines and Gen 2 Dataflows, allowing both reading and writing. One notable difference is that Lakehouse does not support spaces or special characters in column names (read our related blog here), whereas Warehouse does (though using spaces is not recommended, especially when interacting with Spark).

Lakehouse is directly supported for use with Spark, making it a first-class object. In contrast, Warehouse tables must be added as a shortcut in Lakehouse before they can be used with Spark.

	Warehouse	Lakehouse
Suitable For	Building enterprise data warehouse	Handling big data workloads for Data Engineering and Data Science
Type of Data	Structured	Structured and Unstructured
Read through	T-SQL Endpoint Spark (SQL, Scala, Python and R) using Lakehouse shortcuts Data pipelines Gen 2 Dataflows Power BI (DirectLake)	T-SQL Endpoint Spark (SQL, Scala, Python and R) Data pipelines Gen 2 Dataflows Power BI (DirectLake)
Write through	T-SQL Data pipelines Gen 2 Dataflows	Spark (SQL, Scala, Python and R) Data pipelines Gen 2 Dataflows EventStreams
Security	Object Level, Row Level, Column Level, Dynamic data masking	RLS and Table level (only when using SQL Analytics Endpoint)
Zero Copy Cloning	Supported through T-SQL	Supported through Spark
Statistics and Cache Mechanism	Available	Available
Query Insights and DMVs for Monitoring	Available	Available
Shortcuts Data	Not directly supported; Available through Lakehouse	Supported

Conclusion

In conclusion, the comparison between Warehouse and Lakehouse underscores their similarities and differences, particularly in how users interact with each platform. Organizations can select the component that best aligns with their specific requirements, leveraging the ability to perform cross-queries for seamless data interoperability. This flexibility allows businesses to cater to diverse user needs—whether for SQL-based analytics or Spark-driven data engineering—without the need to duplicate data. By understanding the strengths of each platform, organizations can optimize their data workflows within Microsoft Fabric.

Govindarajan D

Data Architect

Govindarajan D is a seasoned Data Architect, leveraging expertise in exploring Microsoft technologies. With a passion for innovative solutions, Govindarajan is not only skilled in crafting robust data architecture but also holds the title of Microsoft Certified Trainer (MCT), ensuring cutting-edge knowledge transfer and continuous learning within the tech community.