In this post I want to explain how to read a CSV file and benchmark multiple methods available in Pandas and Polars.

Let’s use the flights dataset that is ~500MB, so it’s fairly realistic for a benchmark. It’s still in the small end of the spectrum but it will be useful to compare results.

import pandas as pd
import polars as pl

# you can download this locally to avoid any network 
# delays
file = "https://raw.githubusercontent.com/martinbel/datasets/master/flights.csv"

1. Pandas: pd.read_csv engine c

This method takes 9.66 seconds to read the csv file.

df = pd.read_csv(file, sep=',', engine='c')

2. Pandas: with engine pyarrow

This method takes 1.32 seconds to read the csv file. A lot faster than engine=”c”, but it doesn’t always work. Often you get errors with this method.

df = pd.read_csv(file, engine="pyarrow")

3. Polars: pl.read_csv call

This method takes 915 ms, so it’s a faster than the engine="pyarrow" method in pandas and I generally don’t get errors when using it.

df = pl.read_csv(file)

4. Read with Polars & convert to pandas

If I read the csv file with polars and convert it back to pandas with .to_pandas() this takes around 1.6 seconds on average.

df = pl.read_csv(file).to_pandas()

Conclusion

There are the results in relative terms. The c-engine method takes 10 times more time than the polars method.

If you are interested in learning more about data science, check out my youtube channel!


Leave a Reply

Your email address will not be published. Required fields are marked *