If you want to fit a linear model to get insights and you care more about the coefficients rather than making predictions, you will love this trick.

The statsmodels formula interface is basically a copy of a functionality available in the R programming language.

Let’s give it a try with some data.

Linear Model with Categorical Variables

Here I import some libraries and the taxis DataFrame I’ll be using in this example.

import pandas as pd
import statsmodels.api as sm
import statsmodels.formula.api as smf

from seaborn import load_dataset
taxis = load_dataset("taxis")

# some data pre-processing
taxis['net_fare'] = taxis.fare + taxis.tip - taxis.tolls
taxis = taxis[~taxis.pickup_borough.isnull()]
taxis[['net_fare', 'distance', 'pickup_borough']].head()

Let’s try a first model. The net_fare column will be the dependent variable and distance and pickup_borough will be the independent variables, also called predictors.

Notice that distance is numeric and pickup_borough is a categorical variable. The magical thing of the statsmodels formula interface is you just define the model you want and fit it, you don’t need to create dummy variables manually.

# First model
model = smf.ols(formula='net_fare ~ distance + pickup_borough', data=taxis)
res = model.fit()

print(res.summary())

And here we go. In one line of code we define the model using smf.ols. Then res.summary() provides linear model output.

Here we can see that the borough (district) matters a lot in terms of the fare prices.

Linear model with interaction terms

One of my favorite tools of the formula syntax is how easy it is to try models with different specifications. Let’s say we want to fit a model on the interaction of distance and pickup_borough and keep the raw coefficients we computed before.

To do this, you just need to change + for *. If we just wanted the interactions we would have used : instead of *.

model = smf.ols(formula='net_fare ~ distance * pickup_borough', data=taxis)
res = model.fit()
print(res.summary())

And that’s all for today. Hope you have enjoyed this short tutorial on the statsmodels formula syntax in Python!

Check out my youtube channel for more free Data Science content!


Leave a Reply

Your email address will not be published. Required fields are marked *