In this post I cover how you can make histograms using the most popular data visualization libraries in Python. These are Pandas .plot method, Matplotlib, Seaborn, plotly-express and Plotnine.

I’ll demonstrate how to use each library to make a histogram and cover the “small multiple” or faceted histogram use case. If you don’t know what this means, it’s a very useful technique to plot the distribution of a numeric variable conditioned by a categorical variable.

import pandas as pd
import numpy as np

import plotly.express as px
import seaborn as sns
import matplotlib.pyplot as plt
from plotnine import *

# Change the matplotlib default style to seaborn
# You need to do this once
plt.style.use('seaborn')

pd.options.display.max_columns = 500

# Read the data and keep info of 9 states
df = pd.read_csv("https://raw.githubusercontent.com/martinbel/datasets/master/unemployment.csv")
keep_states = ['SC', 'CA', 'FL', 'NY', 'WI', 'WA', 'NJ', 'IL', 'TX']
df = df.query('state == @keep_states')

# show top 3 rows of each state
df.groupby("state").head(3).head(9)
Example Top 3 rows by State

1. Pandas .plot()

Pandas is a great option when you need a simple histogram, I use this a lot for quick plots when doing EDA on one variable.

Here is an example:

df.unemployment.hist(bins=30);
Unemployment Distribution – Histogram (pandas)

This shows the distribution of the unemployment rate where each observation corresponds to a month-state pair.

However we might be interested in plotting the distribution of each state separately. This technique is called “small multiples” or “faceted” chart.

It’s also very simple to do this in Pandas and the result is quite reasonable considering how much effort making this plot took.

df.hist(column='unemployment', by='state', bins=20);
plt.tight_layout()
Small Multiples Histogram – Pandas

For quick plots, I think this is a fairly decent choice. However, if we need more customization there are other options.

2. Matplotlib

In this example I’m going to add more customization to the plot we did with pandas. This allows adding annotations, horizontal lines but it involves writing more code.

This would be a good choice if we need to present this plot in a professional setting where we need to add multiple custom components.

Just to demonstrate this use case, I’m adding a vertical line with the median to each histogram.

group_values = list(df.state.unique())

# set number of columns in the plot
ncols = 3

# calculate number of rows in the plot
nrows = len(group_values) // ncols + (len(group_values) % ncols > 0)

# Define the plot 
plt.figure(figsize = (9, 9))
plt.subplots_adjust(hspace=0.25)
plt.suptitle("Unemployment Rate by State", fontsize=16, y=0.95)

for n, col in enumerate(group_values):
    # add a new subplot at each iteration using nrows and cols
    ax = plt.subplot(nrows, ncols, n + 1)
    
    # Filter the dataframe data for each state
    df_temp = df.query("state == @col")
    df_temp.unemployment.hist(ax=ax, bins=30)
    
    # Let's add some vertical lines with mean, and meadian
    median_x = df_temp.unemployment.median()
    _ = ax.vlines(x=[median_x], ymin=0, ymax=70, colors=['r']);
    
    # Add annotation
    plt.text(median_x, 70, 'Mean')    

    # chart formatting
    ax.set_title(col)
    ax.set_xlabel("")
Histogram – Matplotlib

The positive side of this option is you are in full control. However, this might be too much code if we are just exploring the data.

3. Seaborn

Seaborn is a great trade-off between simplicity and advanced features. It allows to define if we want to share the x or y axis scales very easily.

It’s possible to easily add a kernel density and it has a similar API for all plots. This makes it very intuitive to learn it!

Basically, all plots in seaborn are called sns.****plot. In this case we will use the sns.histplot().

sns.set(style='darkgrid')

g = sns.FacetGrid(df, 
                  col='state',                # facet col variable
                  col_wrap=3,                 # define nbr of subplots per row
                  sharex=False, sharey=False   # Define which axes are shared
                 )
g.map(sns.histplot, 
      'unemployment', 
      kde=True,
      binwidth=0.5             # Width of each bin
     )
Histogram – Seaborn

There is one more alternative plot that can be done very easily with seaborn. This is the sns.kdeplot(). It allows plotting density kernels in one chart. This can be useful if you have a smaller amount of groups.

sns.kdeplot(df, x='unemployment', hue='state')
Seaborn – Density Plot

I just wanted to include this plot for completeness. In this case, I think there are too many states. But if you are comparing 2 or 3 values in a group I think it’s a valid alternative.

4. Plotly-Express

Plotly Express has a great trade-off between simplicity and advanced features. It is as simple as Seaborn but the plots are interactive.

This can be very useful if you are planning to develop an interactive web application.

px.histogram(df, 
             x='unemployment',   # numeric variable
             facet_col='state',  # facet variable
             facet_col_wrap=3,   # nbr plots per row
             histnorm='probability', # optional
             nbins=50                # optional  
            )
Histogram – Plotly Express

Considering how easy it is to make this plot, the fact that it comes with interactivity “for free” is a great plus.

Plotly Express is one of my favorite data visualization libraries in Python. But there are cases when you need more customization and you are better off using matplotlib or the plotly graph objects API.

5. Plotnine

Plotnine is a ggplot2 port for Python. It’s a declarative data visualization library where you add layers to the plot as if you were writing a recipe.

It’s based on ggplot2 so it comes with it’s own philosophy of how the data should be set up before using it (tidy data) but once you understand the core ideas it’s really amazing.

(ggplot(df, aes(x='unemployment')) + 
 geom_histogram() +
 facet_wrap("~ state", scales='y_free')
)
Histograms – Plotnine

If you want to add titles and control the figure size, you need to add a few more lines of code.

(ggplot(df, aes(x='unemployment')) + 
 geom_histogram() +
 facet_wrap("~ state", scales='y_free') +
 theme(figure_size=(8, 8)) +
 xlab("Unemployment") +
 ggtitle("Histogram of Unemployment by State")
)

Conclusion

That covers how to make a histogram using the most popular data visualization libraries in Python.

I think they all have their positive and negative points and I think it’s good to be familiar with all of them. The idea is to be able to use the best tool for the job, and for data visualization in Python it means being familiar with multiple libraries.

Did I mention I have a YouTube channel where I cover data science topics? In this video I explain in more detail what I covered in this post.


Leave a Reply

Your email address will not be published. Required fields are marked *