There are at least 4 methods to do this in pandas but there are only 2 that are simple and efficient. The best solution involves sorting the pandas DataFrame and then selecting the top-n results.

import pandas as pd

# Read the data
url = "https://raw.githubusercontent.com/martinbel/datasets/master/mortality_data.csv"
df_mortality = pd.read_csv(url)
df_mortality.head()
Data Overview

On this concrete dataset, I might want to answer the question: Which are the top-5 countries with higher mortality each year?

In this case I want to get the top-5 values (highest mortality) by year.

Method 1: Sorting and groupby + nth

This is in my opinion the simplest, clearest and best method to compute top-n or smallest-n calculations within a group.

The idea is you first sort by year (categorical variable) and value (numeric variable). Then you groupby year and keep the top-n (last-n) rows for each group.

It’s the most intuitive method and one of the fastest (2.05 ms).

(df_mortality
 .sort_values(by=["year", "value"], 
     ascending=[True, False])
 .groupby(["year"], as_index=False).nth[:5]
).head(15)
Top-n results by year

Method 2: Compute a rank, then filter

This method is also intuitive, we first compute a rank and then filter rows. Simple and effective. The only drawback of this method is the results for each year are not sorted.

This might be ok in some cases. But it’s possible a sorting step might be needed in order to avoid bugs.

This method is the fastest one (1.35 ms) but is sort of incomplete.

# 1. compute a ranking
df_mortality['rank'] = (
    df_mortality
    .groupby("year")['value']
    .rank(method='first', ascending=False)
)
# 2. Filter rows with rank >
df_mort = df_mortality[df_mortality['rank'] <= 5]

# show top rows
df_mort.head(15)

Method 3: groupby + apply (sort_values + head)

These methods are slower but I include them for completeness. I also think the output they provide is less clean than the previous methods.

Here I’m grouping by year and then calling sort_values + head inside of apply. what happens inside of the “apply” call is done at the group level

(df_mortality
 .groupby('year')
 [['country_name', 'value']]
 .apply(lambda x : x.sort_values(by='value', ascending=False).head(5).reset_index(drop=True))
).head(15)
Results

Method 4: groupby + apply (nlargest)

Here I’m grouping by year and then calling nlargest inside of apply.

(df_mortality
 .groupby('year')
 [['value', 'country_name']]
 .apply(lambda x: x.nlargest(5, "value"))
).head(15)

I hope you enjoyed this post! If you enjoyed this post, check out my youtube channel where I cover these topics.


Leave a Reply

Your email address will not be published. Required fields are marked *