```
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn import linear_model
from sklearn.metrics import mean_squared_error
=True) np.set_printoptions(suppress
```

## Regression splines in Python

*The following code tutorial is mainly based on the scikit learn documentation about splines provided by Mathieu Blondel, Jake Vanderplas, Christian Lorentzen and Malte Londschien and code from Jordi Warmenhoven*. To learn more about the spline regression method, review “An Introduction to Statistical Learning”.

Regression splines involve dividing the range of a feature X into K distinct regions (by using so called knots). Within each region, a polynomial function (also called a Basis Spline or B-splines) is fit to the data.

In the following example, various piecewise polynomials are fit to the data, with one knot at age=50 {cite:p}

`James2021`

:

Figures:

Top Left: The cubic polynomials are unconstrained.

Top Right: The cubic polynomials are constrained to be continuous at age=50.

Bottom Left: The cubic polynomials are constrained to be continuous, and to have continuous first and second derivatives.

Bottom Right: A linear spline is shown, which is constrained to be continuous.

The polynomials are ususally constrained so that they join smoothly at the region boundaries, or knots.

Provided that the interval is divided into enough regions, this can produce an extremely flexibel fit {cite:p}

`James2021`

:

Figure: A cubic spline and a natural cubic spline, with three knots. The dashed lines denote the knot locations.

- To understand the advantages of regression splines, we first start with a linear ridge regression model, build a simple polynomial regression and then proceed to splines. qua

## Setup

- Write function to obtain RMSE for training and test data:

```
= []
results
# create function to obtain model rmse
def model_results(model_name):
# Training data
= reg.predict(X_train)
pred_train = round(mean_squared_error(y_train, pred_train, squared=False),4)
rmse_train
# Test data
= reg.predict(X_test)
pred_test =round(mean_squared_error(y_test, pred_test, squared=False),4)
rmse_test
# Save model results
= {"model": model_name, "rmse_train": rmse_train, "rmse_test": rmse_test}
new_results
results.append(new_results)
return results;
```

## Data

### Import

```
= pd.read_csv('https://raw.githubusercontent.com/kirenz/datasets/master/wage.csv')
df 3) df.head(
```

### Create label and feature

- We only use the feature age to predict wage (because we only use one predictor, we don’t perform data standardization)

```
= df[['age']]
X = df[['wage']] y
```

### Data split

- Dividing data into train and test datasets

`= train_test_split(X, y, test_size=0.3, random_state = 1) X_train, X_test, y_train, y_test `

### Data exploration

- Visualize the relationship between age and wage:

```
# seaborn settings
= {"axes.spines.right": False, "axes.spines.top": False}
custom_params ="ticks", rc=custom_params)
sns.set_theme(style
# plot
=X_train['age'], y=y_train['wage'], alpha=0.4); sns.scatterplot(x
```

## Ridge regression

- First, we obtain the optimal alpha parameter with cross validation
- We try different values for alpha:

```
=np.logspace(-6, 6, 13)
alphas alphas
```

- Fit model

```
= linear_model.RidgeCV(alphas=alphas)
reg reg.fit(X_train,y_train)
```

- Show best alpha

` reg.alpha_`

- Show coefficients

```
print(reg.coef_)
print(reg.intercept_)
```

`="Ridge") model_results(model_name`

```
=X_train['age'],
sns.regplot(x=y_train['wage'],
y=None,
ci={"color": "orange"}); line_kws
```

## Polynomial regression

Next, we use a pipeline to add non-linear features to a ridge regression model.

We use

`make_pipeline`

which is a shorthand for the Pipeline constructor- It does not require, and does not permit, naming the estimators.
- Instead, their names will be set to the lowercase of their types automatically:

```
from sklearn.preprocessing import PolynomialFeatures
from sklearn.pipeline import make_pipeline
# use polynomial features with degree 3
= make_pipeline(
reg =3),
PolynomialFeatures(degree=alphas)
linear_model.RidgeCV(alphas
)
reg.fit(X_train, y_train)
```

`="Polynomial Reg") model_results(model_name`

```
# plot
=X_train['age'],
sns.regplot(x=y_train['wage'],
y=None,
ci=3,
order={"color": "orange"}); line_kws
```

## Splines (scikit-learn)

*Note that spline transformers are a new feature in scikit learn 1.0. Therefore, make sure to use the latest version of scikit learn. Use conda list scikit-learn to see which scikit-learn version is installed. If you use Anaconda, you can update all packages using conda update --all*

### Spline transformer

The following function places the knots in a uniform (this is the default) or quantile fashion.

We only have to specify the desired number of knots, and then have SplineTransformer automatically place the corresponding number of knots.

```
from sklearn.preprocessing import SplineTransformer
# use a spline wit 4 knots and 3 degrees
# we combine the spline with a ridge regressions
= make_pipeline(
reg =4, degree=3),
SplineTransformer(n_knots=alphas)
linear_model.RidgeCV(alphas
) reg.fit(X_train, y_train)
```

`= "Cubic Spline") model_results(model_name `

- Obtain knots to show them in the following plot (there are degree number of additional knots each to the left and to the right of the fitted interval. These are there for technical reasons, so we refrain from showing them)

```
= SplineTransformer(n_knots=4, degree=3)
splt
splt.fit(X_train)= splt.bsplines_[0].t
knots
3:-3] knots[
```

```
# Create observations
= np.linspace(X_test.min(),X_test.max(), 100)
x_new # Make some predictions
= reg.predict(x_new)
pred
# plot
=X_train['age'], y=y_train['wage'])
sns.scatterplot(x
='Cubic spline with 4 knots and degree=3', color='orange')
plt.plot(x_new, pred, label3:-3], ymin=0, ymax=300, linestyles="dashed")
plt.vlines(knots[; plt.legend()
```

### Periodic splines

- In some settings, e.g. in time series data with seasonal effects, we expect a periodic continuation of the underlying signal.
- Such effects can be modelled using periodic splines, which have equal function value and equal derivatives at the first and last knot.
- Review this notebook to learn more about periodic splines in scikit learn: periodic splines

## Splines (statsmodels)

- In statsmodels, we can use pandas dataframes so let’s join our label and feature:

```
= y_train.join(X_train)
df_train = y_test.join(X_test)
df_test
df_train
```

Instead of providing the number of knots, in statsmodels, we have to specify the degrees of freedom (df).

`df`

defines how many parameters we have to estimate.They have a specific relationship with the number of knots and the degree, which depends on the type of spline (see Stackoverflow):

In the case of

**B-splines**:- \(df=𝑘+degree\) if you specify the knots or
- \(𝑘=df−degree\) if you specify the degrees of freedom and the degree.

As an example:

A cubic spline (degree=3) with 4 knots (K=4) will have \(df=4+3=7\) degrees of freedom. If we use an intercept, we need to add an additional degree of freedom.

A cubic spline (degree=3) with 5 degrees of freedom (df=5) will have \(𝑘=5−3=2\) knots (assuming the spline has no intercept).

In our case, we want to fit a cubic spline (degree=3) with an intercept and three knots (K=3). This equals \(df=3+3+1=7\) for our feature. This means that these degrees of freedom are used up by an intercept, plus six basis functions.

```
import statsmodels.api as sm
from statsmodels.gam.api import GLMGam, BSplines
= BSplines(X_train[['age']], df=7, degree=3) bs
```

- We fit a Generalized Additive Model (GAM):

`= GLMGam.from_formula('wage ~ age', data=df_train, smoother=bs).fit() reg `

- Take a look at the results

`print(reg.summary())`

- Obtain RMSE for training and test data:

```
= "Spline statsmodels"
model_name
# Training data
'pred_train'] = reg.predict()
df_train[= round(mean_squared_error(df_train['wage'], df_train['pred_train'], squared=False),4)
rmse_train
# Test data
'pred_test'] = reg.predict(df_test, exog_smooth= df_test['age'])
df_test[= round(mean_squared_error(df_test['wage'], df_test['pred_test'], squared=False),4)
rmse_test
# Save model results
= {"model": model_name, "rmse_train": rmse_train, "rmse_test": rmse_test}
new_results
results.append(new_results)
results
```

- We plot the spline with the Statsmodel’s function .plot_partial:

`0, cpr=True, plot_se=False) reg.plot_partial(`

## Natural spline

Finally, we fit a natural spline with patsy and statsmodels.

In patsy one can specify the number of degrees of freedom directly (actual number of columns of the resulting design matrix)

In the case of natural splines: \(df=𝑘−1\) if you specify the knots or \(𝑘=df+1\) if you specify the degrees of freedom.

```
from patsy import dmatrix
= dmatrix("cr(train, df = 3)", {"train": X_train}, return_type='dataframe') bs
```

- We use statsmodel’s “Generalized linear models” (review this tutorial to learn more about GAMs)

`= sm.GLM(y_train, bs).fit() reg `

```
= 'Natural Spline'
model_name
# Training data
= reg.predict(dmatrix("cr(train, df=3)", {"train": X_train}, return_type='dataframe'))
pred_train = round(mean_squared_error(y_train, pred_train, squared=False),4)
rmse_train
# Test data
= reg.predict(dmatrix("cr(test, df=3)", {"test": X_test}, return_type='dataframe'))
pred_test = round(mean_squared_error(y_test, pred_test, squared=False),4)
rmse_test
# Save model results
= {"model": model_name, "rmse_train": rmse_train, "rmse_test": rmse_test}
new_results
results.append(new_results)
results
```

- Plot model

```
= np.linspace(X_test.min(),X_test.max(), 100)
xp
# Make predictions
= reg.predict(dmatrix("cr(xp, df=3)", {"xp": xp}, return_type='dataframe'))
pred
# plot
=X_train['age'], y=y_train['wage'])
sns.scatterplot(x='orange', label='Natural spline with df=3')
plt.plot(xp, pred, color; plt.legend()
```

## Models summary

- Create a dataframe from model results

```
= pd.DataFrame(results)
df_results
=['rmse_test']) df_results.sort_values(by
```