Polars allows handling larger than RAM data very easily. In this post I show you with some real world data how you can work with data larger than RAM.

First, let’s download the data.

  1. Create a folder called nyc_taxi_data
  2. Download this file
  3. Run this command on the terminal: wget -i setup_files/raw_data_urls.txt -P data/ -w 2

Some comments on the data size

  • There are 100 .parquet files in total that take 8.2GB of space.
  • I run a small test on one of the files and as a CSV each parquet file uses 9.6 times more space.
  • Therefore in CSV format we can expect there will be 79GB of data.

Now import the libraries. You will have to pip install them on your local environment. For example, pip install polars.

import polars as pl
import numpy as np
from pathlib import Path

Now define a file path using the pathlib library. In my case, the data is located at this folder: /mnt/yt_videos/nyc_taxi_data .

data_path = Path("/mnt/yt_videos/nyc_taxi_data")

The data includes several parquet files. Now I will select one file to try out some ideas interactively. This file I can read in memory, if you get a memory error it’s likely you should get more RAM on your machine to work with data.

file = data_path / "fhv_tripdata_2018-12.parquet"

Now I’m going to read this file. It has 23.9 millon rows and 7 columns.

df_sample = pl.read_parquet(file)
df_sample.head()
top rows of sample file

2. Compute an aggregation on the sample data

Before trying to work with all the data we need to decide what we are going to do! In order to do this, I can start working with this sample file.

This would depend on the use case, but the idea is to try a few ideas interactively until we know what we want to do with the complete dataset.

After some experimentation I decided to compute the count of taxi rides by pickup_datetime and dispatching_base_num.

This aggregation took 74ms on the sample dataset. Not bad!

agg = (
    df_sample
    .with_columns(
        pl.col("pickup_datetime").dt.date()
    )
    .groupby(["pickup_datetime", "dispatching_base_num"])
    .agg(
        pl.count()
    )
    .sort(by=['pickup_datetime', 'count'])
)
agg.head()

3. Run the aggregation on all the parquet files

Now that I know what I want to do with this data, I can use the lazy mode and aggregate on all the data. This took 32 seconds!

The only changes I had to make compared to the previous code snippet is replacing df_sample with the pl.scan_parquet call and I added a .collect(streaming=True) call at the end of the code.

The streaming=True argument tells polars we want to do use the “lazy” API. From the polars docs: “When a lazy query is executed in streaming mode Polars processes the dataset in batches rather than all-at-once. This can allow Polars to process datasets that are larger-than-memory.”

agg_all = (
    pl.scan_parquet(data_path / "*.parquet")
    .with_columns(
        pl.col("pickup_datetime").dt.date()
    )
    .groupby(["pickup_datetime", "dispatching_base_num"])
    .agg(
        pl.count()
    )
    .sort(by=['pickup_datetime', 'count'])
    .collect(streaming=True)
)
agg_all.head()
Top Rows

After doing this aggregation we can convert the results to pandas to “finish” the data analysis. Perhaps we want to visualize the data and this is just easier with Pandas at the moment.

agg_date = (agg_all
 .to_pandas()
 .groupby("pickup_datetime")
 ['count']
 .sum()

)
agg_date[:"2019-01-01"].plot()
Data Visualization

And that’s a wrap! We were able to aggregate a fairly large dataset (80GB uncompressed) in 32 seconds with just changing a two lines of code.

I show how this works in a bit more detail in this video tutorial: 👇️👇️👇️

Thanks for reading!


Leave a Reply

Your email address will not be published. Required fields are marked *