Why is plotly-express so great?

Plotly Express is a great library in Python to do data visualization. It’s quite user friendly and generates interactive plots that can be shared on a notebook or a web application.

I think it’s one of the best tools in Python to do Exploratory Data Analysis (EDA) efficiently. When we are doing EDA, we need to try multiple ideas quickly. This means we need to very quickly define the type of plot we want and the library should take care of the small details.

Even if you are just starting out, I’d encourage you to learn plotly-express most important concepts before learning matplotlib or seaborn.

Top 5 plotly-express Tips & Tricks

Let’s get started by reading in some data. Here I’ll be using the datasets that are part of the seaborn library.

import pandas as pd
import plotly.express as px

from seaborn import load_dataset
pd.set_option('display.max_columns', None)

# Read the diamonds and taxis datasets
diamonds = load_dataset("diamonds")
taxis = load_dataset("taxis")

# print the top rows of the diamonds dataset
diamonds.head()
Diamonds Dataset
taxis.head()
Taxi Rides Dataset

Boxplot: Relation of categorical and numeric variables

The boxplot summarizes a numeric variable distribution. Check out the wikipedia page to understand how to interpret each boxplot in detail.

I generally use boxplots to compare distributions of a numeric variable (here I use price of the diamonds) given a categorical variable (in this case it’s the diamond color).

This plot is very powerful. In one graph you can compare the price distribution given the color.

px.box(diamonds, x='color', y='price', color='color')
Box-plot

There is one small detail missing to make this plot easier to read. Sorting the x-axis values given the median would make the plot easier to read. This is particularly important if you have more values of a categorical variable.

# compute the median price by color and change the order of the boxplot
def sort_categorical_var(df, catvar, numvar, ascending=False):
    group_agg = df.groupby(catvar)[numvar].median()
    group_agg = {catvar: list(group_agg.sort_values(ascending=ascending).index)}
    return group_agg


group_agg = sort_categorical_var(diamonds, "color", "price")

# Box-plot call
px.box(diamonds, 
       x='color', y='price', color='color',         # same as above
       category_orders=group_agg,                   # pass order of x-axis
       title = 'Diamond Prices by color'            # add a title
      )
Box-Plot

This plot is a lot easier to read. Now we can compare the distributions starting from the highest price (J) and go to the lower priced color (E).

I used the median to sort the values as the median is part of the boxplot.

Small Multiples – Histogram

A data visualization method that serves a similar purpose as the boxplot is doing small multiples with the histogram.

# I use an independent scale for each plot
fig = px.histogram(diamonds, 
             x='price', 
             facet_col='color',
             facet_col_wrap=4,
             category_orders=group_agg,
             title = 'Diamond Prices by color'            # add a title
      )

fig.update_yaxes(matches=None, showticklabels=True)
fig.show()
Price Histogram – by color

There is an statistical interpretation of this plot. We can think of each histogram as the distribution of price given the color.

3 Small multiples & Scatter plots

This type of plot also has a statistical interpretation. We can think as the scatter plot as a linear model where net_fare (y-axis) is the dependent variable and distance (x-axis) is the independent variable.

The small multiples allow me to include categorical variables into the model. The idea now is we are plotting the relation between two numeric variables conditioned by a categorical variable.

fig = px.scatter(taxis, 
                 x='distance', 
                 y='net_fare',
                 facet_col='pickup_borough',
                 trendline = 'ols',
                 title = "Total Fare explained by Distance & Borough (district)"
)
fig

small multiple – scatter plot

4. Small multiples (facet_col and facet_row)

I wanted to show this example for completeness. In plotly-express you can define a matrix of plots using facet_col and facet_row.

fig = px.scatter(taxis, 
                 x='distance', 
                 y='net_fare',
                 facet_col='pickup_borough',
                 facet_row='color',
                 trendline = 'ols',
                 title = "Total Fare explained by Distance,  Borough (district)"
)
fig
small multiple – scatter plot

In this plot, the response (net_fare) is being explained by distance given the facet_col (pickup_borough) variable and the facet_row (color) variable.

Small Multiples & Line Plots

I’ll start by computing the number of taxi rides by date with pandas.

Here I’m using groupby and size to count the number of observations by “date” and “pickup_borough”.

# extract the date of the trip
taxis['date'] = taxis.pickup.dt.date

# compute the number of rides by date and borought
count_rides = (
    taxis
    .groupby(["date", "pickup_borough"])
    .size()
    .reset_index(name='count')
    .rename(columns={"pickup_borough":"borough"})
)
count_rides.head()
Number of Taxi Rides

Another interesting use case of small multiples is with time series data. There is a limit of the number of lines that can be plotted together in one plot. This is where this type of plot is useful.

fig = px.line(count_rides, 
              x='date', y='count', 
              color='borough',
              facet_row='borough', 
              title='Number of Taxi rides by Pick-up Borough (district)'
             )

# makes the y-axis free for each sub-plot
fig.update_yaxes(matches=None, showticklabels=True)
fig.show()

I hope you have enjoyed this plotly-express tutorial! It’s one of my favorite tools for data visualization in Python.

Check out this video tutorial where I cover a bit more in detail the content presented in this post.


Leave a Reply

Your email address will not be published. Required fields are marked *