There are multiple ways to do this in pandas let’s take a look at an example. Each has it’s merits depending on the use case. It’s one of those cases where you need to know all of them.

Let’s first read some data. I’ll be using the diamonds dataset.

import pandas as pd

df = pd.read_csv("https://raw.githubusercontent.com/martinbel/datasets/master/diamonds.csv")
df.head()
Diamonds dataset top-5 rows

Method 1: This is a great method for using one logical condition

Here I filter the rows where the cut is equal to “Premium”. I like this method to do a quick and dirty filter on the data. But if I’m going to do more filtering, I’ll use other methods.

df[df.cut == 'Premium']

Method 2: Filter using a list

This method is useful when you have a list of values and you just want to keep the rows with the matching values.

Here I’m filtering the rows where cut is Ideal or Premium.

df[df.cut.isin(['Ideal', "Premium"])]

Method 3: Multiple boolean conditions

It’s possible to add multiple boolean conditions using logical operatios (& and |). However, in order to do this we need to use ( ) for each expression. This makes the code less readable in my opinion.

In case you don’t know this. The & operator means “and” while the | means “or”.

df[(df.cut == 'Premium') & (df.clarity == 'SI2')]

Method 4: Most flexible method

The df.query method is the most flexible and concise method to filter a dataframe.

It allows combining “isin” type of expressions with simply == expressions. The advantage is you can just pass the variable names without the DataFrame name, also you don’t need to pass parenthesis to separate each expression.

keep_cuts = ['Ideal', "Premium"]
df.query("cut == @keep_cuts & clarity == 'SI2'")

Other methods:

It’s also possible to use the df.loc and df.iloc methods to filter a DataFrame. However, I think it’s a better practice to use the methods I described above. When a boolean mask gets created, the integer with True values are filtered out.

This would be equivalent to using df.iloc, both of these methods provide the same result.

import numpy as np

# using a list
df.iloc[[1, 2, 3]]

# using an np.array
df.iloc[np.array([1, 2, 3])]

I hope you enjoyed this video. Check this youtube video with the top-10 Pandas tips & tricks I use for my data science work.


Leave a Reply

Your email address will not be published. Required fields are marked *