In today’s data-driven world, managing large datasets efficiently is crucial for maintaining performance and ensuring seamless operations. One way to achieve this is through data partitioning. In this post, we’ll how to use partitioning to optimize performance and storage when working with New York taxi data spanning from 2016 to 2018 as an example.
What is Data Partitioning?
Data partitioning involves dividing large datasets into smaller, more manageable pieces. When dealing with millions of rows of data, partitioning by a key attribute—such as date—can significantly improve efficiency when querying and analyzing the data.
In this example, we’ll work with New York taxi data from 2016, 2017, and 2018. Each year contained millions of rows, making a partitioning strategy essential for maintaining performance.
The Traditional Approach
The traditional approach involves creating a dataflow, pulling data from the main source, and applying a date filter. For instance, filter the data to capture only the records from 2016. After fetching the data, send it to a lakehouse, which serves as the storage layer. Finally, append the data for 2016, 2017, and 2018 to create a comprehensive dataset.

While this method works, there’s a more efficient approach worth exploring.
A More Streamlined Method
The first step in the new approach is creating a data pipeline. Here’s how it works:
- Set Up a Data Pipeline: In the workspace, create a new data pipeline using the “Copy Data” method. This approach allows for seamless data ingestion.



2. Partition by Date: Name the task “Taxi Data 2016” and select the data source. Use SQL to partition the data by date, pulling only the records from 2016.


3. Store the Data: Once partitioned, send the data to the lakehouse, creating a new table named “NYC_Taxi_Data.”

4. Repeat for Subsequent Years: Repeat the same process for 2017 and 2018, adjusting the SQL queries accordingly.
- This systematic approach ensures that data is ingested efficiently, year by year, and stored in the lakehouse.
Avoiding Data Duplication
One potential issue with this method is data duplication. For example, when appending new data (such as 2019 or 2020), the data for 2016 through 2018 could duplicate because it keeps adding on top of itself. To prevent this, add a final step: creating a notebook.
Creating a Notebook to Prevent Duplication
Here’s how to set up the notebook to ensure data integrity:
- Create a Notebook: In the workspace, create a new notebook. The goal is to have the notebook automatically drop any existing table before starting the ingestion process.


2. Writing the SQL Statement: Reference the lakehouse as the data source then set the notebook’s language to Spark SQL. Now, you can write the SQL statement to drop the table if it already exists.

3. Incorporate the Notebook into the Pipeline: Add this notebook to the beginning of the pipeline to ensure that, before data ingestion starts, the old table is dropped. This prevents duplication during the appending process.



The Final Workflow
Once everything is set up, the pipeline works as follows:
- Step 1: The notebook logic searches for the existing table and drops it if it exists.
- Step 2: The pipeline ingests data for 2016, 2017, and 2018 sequentially.
- Step 3: The pipeline appends the data, giving a comprehensive view of the New York taxi data without duplicates.
Conclusion
Partitioning data by date and using a structured data pipeline helps improve efficiency. By adopting this streamlined approach, we can manage large datasets more effectively while maintaining performance.
Partitioning is an incredibly powerful technique, especially when working with massive datasets like taxi records. By ensuring efficient data management and preventing duplication, you can maintain the integrity of your data while also keeping processes smooth and optimized.