Regression splines in Python

Regression splines in Python

Regression splines in Python: Cubic spline and natural cubic spline

The following code tutorial is mainly based on code provided by Jordi Warmenhoven. To learn more about the regression methods, review “An Introduction to Statistical Learning” from James et al. (2021).

Data

Import

import pandas as pd

df = pd.read_csv('https://raw.githubusercontent.com/kirenz/datasets/master/wage.csv')
df
Unnamed: 0 year age maritl race education region jobclass health health_ins logwage wage
0 231655 2006 18 1. Never Married 1. White 1. < HS Grad 2. Middle Atlantic 1. Industrial 1. <=Good 2. No 4.318063 75.043154
1 86582 2004 24 1. Never Married 1. White 4. College Grad 2. Middle Atlantic 2. Information 2. >=Very Good 2. No 4.255273 70.476020
2 161300 2003 45 2. Married 1. White 3. Some College 2. Middle Atlantic 1. Industrial 1. <=Good 1. Yes 4.875061 130.982177
3 155159 2003 43 2. Married 3. Asian 4. College Grad 2. Middle Atlantic 2. Information 2. >=Very Good 1. Yes 5.041393 154.685293
4 11443 2005 50 4. Divorced 1. White 2. HS Grad 2. Middle Atlantic 2. Information 1. <=Good 1. Yes 4.318063 75.043154
... ... ... ... ... ... ... ... ... ... ... ... ...
2995 376816 2008 44 2. Married 1. White 3. Some College 2. Middle Atlantic 1. Industrial 2. >=Very Good 1. Yes 5.041393 154.685293
2996 302281 2007 30 2. Married 1. White 2. HS Grad 2. Middle Atlantic 1. Industrial 2. >=Very Good 2. No 4.602060 99.689464
2997 10033 2005 27 2. Married 2. Black 1. < HS Grad 2. Middle Atlantic 1. Industrial 1. <=Good 2. No 4.193125 66.229408
2998 14375 2005 27 1. Never Married 1. White 3. Some College 2. Middle Atlantic 1. Industrial 2. >=Very Good 1. Yes 4.477121 87.981033
2999 453557 2009 55 5. Separated 1. White 2. HS Grad 2. Middle Atlantic 1. Industrial 1. <=Good 1. Yes 4.505150 90.481913

3000 rows × 12 columns

Create label and feature

We only use the feature age to predict wage:

X = df[['age']]
y = df[['wage']]

Data split

Dividing data into train and test datasets

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state = 1)

Data exploration

Visualize the relationship between age and wage:

import seaborn as sns  

# seaborn settings
custom_params = {"axes.spines.right": False, "axes.spines.top": False}
sns.set_theme(style="ticks", rc=custom_params)

# plot
sns.scatterplot(x=X_train['age'], y=y_train['wage'], alpha=0.4);

Simple regression

from sklearn.linear_model import LinearRegression

lm = LinearRegression()
lm.fit(X_train,y_train)
LinearRegression()
print(lm.coef_)
print(lm.intercept_)
[[0.72106369]]
[80.58924709]
from sklearn.metrics import mean_squared_error

# Training data
pred_train = lm.predict(X_train)
rmse_train = mean_squared_error(y_train, pred_train, squared=False)

# Test data
pred_test = lm.predict(X_test)
rmse_test =mean_squared_error(y_test, pred_test, squared=False)

# Save model results
model_results_lm = pd.DataFrame(
    {
    "model": "Linear Model (lm)",  
    "rmse_train": [rmse_train], 
    "rmse_test": [rmse_test],
    })
model_results_lm
model rmse_train rmse_test
0 Linear Model (lm) 40.705334 41.413848
sns.regplot(x=X_train['age'], 
            y=y_train['wage'], 
            ci=None, 
            line_kws={"color": "orange"});

Polynomial regression

from sklearn.preprocessing import PolynomialFeatures

# polynomial degree 2
poly = PolynomialFeatures(2)

X_train_poly = poly.fit_transform(X_train)
X_test_poly = poly.fit_transform(X_test)
pm = LinearRegression()
pm.fit(X_train_poly,y_train)
LinearRegression()
# Training data
pred_train = pm.predict(X_train_poly)
rmse_train = mean_squared_error(y_train, 
                                pred_train, 
                                squared=False)

# Test data
pred_test = pm.predict(X_test_poly)
rmse_test =mean_squared_error(y_test, 
                              pred_test, 
                              squared=False)

# Save model results
model_results_pm = pd.DataFrame(
    {
    "model": "Polynomial Model (pm)",  
    "rmse_train": [rmse_train], 
    "rmse_test": [rmse_test],
    })

results = pd.concat([model_results_lm, model_results_pm], axis=0)
results
model rmse_train rmse_test
0 Linear Model (lm) 40.705334 41.413848
0 Polynomial Model (pm) 39.823165 40.344788
# plot
sns.regplot(x=X_train['age'], 
            y=y_train['wage'], 
            ci=None, 
            order=2, 
            line_kws={"color": "orange"});

Cubic spline

We use the module patsy to create non-linear transformations of the input data. We will fit 2 models with different number of knots.

from patsy import dmatrix
# Generating cubic spline with 3 knots at 25, 40 and 60
transformed_x = dmatrix(
            "bs(train, knots=(25,40,60), degree=3, include_intercept=False)", 
                {"train": X_train},return_type='dataframe')
transformed_x.head()
Intercept bs(train, knots=(25, 40, 60), degree=3, include_intercept=False)[0] bs(train, knots=(25, 40, 60), degree=3, include_intercept=False)[1] bs(train, knots=(25, 40, 60), degree=3, include_intercept=False)[2] bs(train, knots=(25, 40, 60), degree=3, include_intercept=False)[3] bs(train, knots=(25, 40, 60), degree=3, include_intercept=False)[4] bs(train, knots=(25, 40, 60), degree=3, include_intercept=False)[5]
1045 1.0 0.000000 0.114796 0.618564 0.262733 0.003906 0.0
2717 1.0 0.000000 0.024796 0.477428 0.456182 0.041594 0.0
2835 1.0 0.070523 0.598567 0.319030 0.011879 0.000000 0.0
2913 1.0 0.000000 0.000272 0.241156 0.576321 0.182250 0.0
959 1.0 0.000000 0.034014 0.508194 0.426542 0.031250 0.0

We use statsmodels to estimate a generalized linear model:

import statsmodels.api as sm
# Fitting generalised linear model on transformed dataset
cs = sm.GLM(y_train, transformed_x).fit()
# Training data
pred_train = cs.predict(dmatrix("bs(train, knots=(25,40,60), include_intercept=False)", {"train": X_train}, return_type='dataframe'))
rmse_train = mean_squared_error(y_train, pred_train, squared=False)

# Test data
pred_test = cs.predict(dmatrix("bs(test, knots=(25,40,60), include_intercept=False)", {"test": X_test}, return_type='dataframe'))
rmse_test =mean_squared_error(y_test, pred_test, squared=False)

# Save model results
model_results_cs = pd.DataFrame(
    {
    "model": "Cubic spline (cs)",  
    "rmse_train": [rmse_train], 
    "rmse_test": [rmse_test]
    })
results = pd.concat([results, model_results_cs], axis=0)
results
model rmse_train rmse_test
0 Linear Model (lm) 40.705334 41.413848
0 Polynomial Model (pm) 39.823165 40.344788
0 Cubic spline (cs) 39.726084 40.267857
import numpy as np
import matplotlib.pyplot as plt

# Create observations
xp = np.linspace(X_test.min(),X_test.max(), 100)
# Make some predictions
pred = cs.predict(dmatrix("bs(xp, knots=(25,40,60), include_intercept=False)", {"xp": xp}, return_type='dataframe'))

# plot
sns.scatterplot(x=X_train['age'], y=y_train['wage'])

plt.plot(xp, pred, label='Cubic spline with degree=3 (3 knots)', color='orange')
plt.legend();

Natural cubic spline

transformed_x3 = dmatrix("cr(train,df = 3)", {"train": X_train}, return_type='dataframe')

ncs = sm.GLM(y_train, transformed_x3).fit()
# Training data
pred_train = ncs.predict(dmatrix("cr(train, df=3)", {"train": X_train}, return_type='dataframe'))
rmse_train = mean_squared_error(y_train, pred_train, squared=False)

# Test data
pred_test = ncs.predict(dmatrix("cr(test, df=3)", {"test": X_test}, return_type='dataframe'))
rmse_test =mean_squared_error(y_test, pred_test, squared=False)

# Save model results
model_results_ncs = pd.DataFrame(
    {
    "model": "Natural cubic spline (ncs)",  
    "rmse_train": [rmse_train], 
    "rmse_test": [rmse_test]
    })

results = pd.concat([results, model_results_ncs], axis=0)
results
model rmse_train rmse_test
0 Linear Model (lm) 40.705334 41.413848
0 Polynomial Model (pm) 39.823165 40.344788
0 Cubic spline (cs) 39.726084 40.267857
0 Natural cubic spline (ncs) 39.882574 40.325236
# Make predictions
pred = ncs.predict(dmatrix("cr(xp, df=3)", {"xp": xp}, return_type='dataframe'))

# plot
sns.scatterplot(x=X_train['age'], y=y_train['wage'])
plt.plot(xp, pred, color='orange', label='Natural spline with df=3')
plt.legend();

Avatar
Jan Kirenz
Professor

I’m a data scientist educator and consultant.