There are multiple ways to detect outliers. In this post I’ll cover two simple methods based on statistics and a third method where I take into account the time series nature of the data.

import pandas as pd
import numpy as np

# Read the data
url = "https://raw.githubusercontent.com/martinbel/datasets/master/anomaly_detection.csv"

df = pd.read_csv(url, 
                parse_dates=['timestamp'], 
                index_col=['timestamp'])
df.columns = ['y']

df.head()
Top Rows

Let’s plot the empirical distribution using a histogram. For these type of univariate plots I generally use the pandas .plot() method.

df.y.hist(bins=30)

Method 1: Simplest approach

In this method I use the mean to define the center of the distribution and the two times the standard deviation to help detect outliers.
This is by far the simplest method possible to detect outliers. However, it will only work well if the data is close to a normal distribution.

# compute the statistics
avg = df.y.mean()
std_dev = df.y.std()

# Define inferior and superior limits for outliers
threshold = 2
lim_inferior = avg - threshold * std_dev
lim_superior = avg + threshold * std_dev

# use np.where to flag outliers
df['fl_outlier_m1'] = np.where(df.y > lim_superior, 1,
                               np.where(df.y < lim_inferior, 1, 0))

# plot the distribution with the inferior and superior limits
ax = df.y.hist(bins=30)
_ = ax.vlines(x=[lim_inferior, lim_superior], ymin=0, ymax=2000, colors='r')
ax.set_title("Outlier detection: Method 1");
Histogram – Method 1

For such a simple method we get a fairly reasonable result. It’s able to capture the negative outliers (left side of the histogram) but it’s too aggresive on the right side of the distribution.

This could be fixed by using a different threshold on the right side. However, this is something we will probably want to automate.

Method 2: Simple method based on robust statistics

This method would work better if the data had a few very extreme outliers. This is a more conservative approach but will work well in multiple real life situations.
The advantage of using the median and interquartile range to detect outliers is neither of these metrics can be affected by a small amount of outliers.

# compute the statistics
med = df.y.median()                             
# compute the quartiles 1 and 3
q1, q3 = df.y.quantile([0.25, 0.75]).values
iqr = q3 - q1

# Define inferior and superior limits for outliers
threshold = 1.5
lim_inferior = med - threshold * iqr
lim_superior = med + threshold * iqr

# use np.where to flag outliers
df['fl_outlier_m2'] = np.where(df.y > lim_superior, 1,
                               np.where(df.y < lim_inferior, 1, 0))

# plot the distribution with the inferior and superior limits
ax = df.y.hist(bins=30)
_ = ax.vlines(x=[lim_inferior, lim_superior], ymin=0, ymax=2000, colors='r')
ax.set_title("Outlier detection: Method 2");
Histogram – Method 2

This method is relatively similar with this data but in my experience it’s a lot more reliable. However, we get a similar issue with the right side of the distribution.

Method 3: Outlier detection using rolling functions

df.y.plot()
y – Time Series

The data has a time dependency which the static methods I covered doesn’t exploit. I wanted to keep this post simple because the “simple” methods are very useful in practical terms.

However, here is an example on how you can use “method 1” on a rolling window. What is a rolling window? It’s a very powerful concept related to time series where we generally do some computation with a “rolling” or moving window of observations.

Let’s take a look at an example on how this is possible in Pandas. In this example I use a window of 60 times 24 which is a day of data.

The approach is exactly the same as method 1 but I use only the previous day observations to compute the mean and standard deviation.

# Define a time window to look back 
n_window = 60 * 24       # 60 minutes * 24 hours 

# the .rolling call computes the mean and standard deviation for the n_window observations
df['roll_mean'] = df.y.rolling(window=n_window, min_periods=None, center=False).mean()
df['roll_std'] = df.y.rolling(window=n_window, min_periods=None, center=False).std()

# Define inferior and superior limits for outliers
threshold = 2

# define rolling superior (inferior) thresholds
df['roll_superior'] = df.roll_mean + df.roll_std * threshold
df['roll_inferior'] = df.roll_mean - df.roll_std * threshold

# use np.where to flag outliers
df['fl_outlier_m3'] = np.where(df.y > df.roll_superior, 1,
                               np.where(df.y < df.roll_inferior, 1, 0))

ax = df.loc[:, ['y', 'roll_superior', 'roll_inferior']].plot()
ax.set_title("Outlier detection using a rolling window approach")
Method 3

I hope you have enjoyed this post! I covered three methods to detect outliers with Pandas.

If you want to learn more, check out this youtube video.


Leave a Reply

Your email address will not be published. Required fields are marked *