Regression#
Co-author
Prerequisites
Outcomes
Recall linear regression from linear algebra
Know what feature engineering is and how feature engineering can be automated by neural networks
Understand the related concepts of overfitting and regularization
Understand lasso regression and its relation to linear regression
Understand regression forests as well as its relation to linear regression
Understand the basics of the multi-layer perceptron
Use scikit-learn to fit linear regression, lasso, regression forests, and multi-layer perceptron to data on housing prices near Seattle, WA
# Uncomment following line to install on colab
#! pip install fiona geopandas xgboost gensim folium pyLDAvis descartes
Introduction to Regression#
The goal of regression analysis is to provide accurate mapping from one or more input variables (called features in machine learning or exogenous variables in econometrics) to a continuous output variable (called the label or target in machine learning and the endogenous variable in econometrics).
In this lecture, we will study some of the most fundamental and widely-used regression algorithms.
We will follow this same general pattern when we learn each algorithm:
Describe the mathematical foundation for the algorithm
Use the scikit-learn python package to apply the algorithm to a real world dataset on house prices in Washington state
Dataset#
Let’s load the dataset and take a quick look at our task.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
colors = ['#165aa7', '#cb495c', '#fec630', '#bb60d5', '#f47915', '#06ab54', '#002070', '#b27d12', '#007030']
# We will import all these here to ensure that they are loaded, but
# will usually re-import close to where they are used to make clear
# where the functions come from
from sklearn import (
linear_model, metrics, neural_network, pipeline, model_selection
)
url = "https://datascience.quantecon.org/assets/data/kc_house_data.csv"
df = pd.read_csv(url)
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 21613 entries, 0 to 21612
Data columns (total 21 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 id 21613 non-null int64
1 date 21613 non-null object
2 price 21613 non-null float64
3 bedrooms 21613 non-null int64
4 bathrooms 21613 non-null float64
5 sqft_living 21613 non-null int64
6 sqft_lot 21613 non-null int64
7 floors 21613 non-null float64
8 waterfront 21613 non-null int64
9 view 21613 non-null int64
10 condition 21613 non-null int64
11 grade 21613 non-null int64
12 sqft_above 21613 non-null int64
13 sqft_basement 21613 non-null int64
14 yr_built 21613 non-null int64
15 yr_renovated 21613 non-null int64
16 zipcode 21613 non-null int64
17 lat 21613 non-null float64
18 long 21613 non-null float64
19 sqft_living15 21613 non-null int64
20 sqft_lot15 21613 non-null int64
dtypes: float64(5), int64(15), object(1)
memory usage: 3.5+ MB
This dataset contains sales prices on houses in King County (which includes Seattle), Washington, from May 2014 to May 2015.
The data comes from Kaggle . Variable definitions and additional documentation are available at that link.
X = df.drop(["price", "date", "id"], axis=1).copy()
# convert everything to be a float for later on
for col in list(X):
X[col] = X[col].astype(float)
X.head()
bedrooms | bathrooms | sqft_living | sqft_lot | floors | waterfront | view | condition | grade | sqft_above | sqft_basement | yr_built | yr_renovated | zipcode | lat | long | sqft_living15 | sqft_lot15 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 3.0 | 1.00 | 1180.0 | 5650.0 | 1.0 | 0.0 | 0.0 | 3.0 | 7.0 | 1180.0 | 0.0 | 1955.0 | 0.0 | 98178.0 | 47.5112 | -122.257 | 1340.0 | 5650.0 |
1 | 3.0 | 2.25 | 2570.0 | 7242.0 | 2.0 | 0.0 | 0.0 | 3.0 | 7.0 | 2170.0 | 400.0 | 1951.0 | 1991.0 | 98125.0 | 47.7210 | -122.319 | 1690.0 | 7639.0 |
2 | 2.0 | 1.00 | 770.0 | 10000.0 | 1.0 | 0.0 | 0.0 | 3.0 | 6.0 | 770.0 | 0.0 | 1933.0 | 0.0 | 98028.0 | 47.7379 | -122.233 | 2720.0 | 8062.0 |
3 | 4.0 | 3.00 | 1960.0 | 5000.0 | 1.0 | 0.0 | 0.0 | 5.0 | 7.0 | 1050.0 | 910.0 | 1965.0 | 0.0 | 98136.0 | 47.5208 | -122.393 | 1360.0 | 5000.0 |
4 | 3.0 | 2.00 | 1680.0 | 8080.0 | 1.0 | 0.0 | 0.0 | 3.0 | 8.0 | 1680.0 | 0.0 | 1987.0 | 0.0 | 98074.0 | 47.6168 | -122.045 | 1800.0 | 7503.0 |
# notice the log here!
y = np.log(df["price"])
df["log_price"] = y
y.head()
0 12.309982
1 13.195614
2 12.100712
3 13.311329
4 13.142166
Name: price, dtype: float64
While we will be using all variables in X
in our regression models,
we will explain some algorithms that use only the sqft_living
variable.
Here’s what the log house price looks like against sqft_living
:
def var_scatter(df, ax=None, var="sqft_living"):
if ax is None:
_, ax = plt.subplots(figsize=(8, 6))
df.plot.scatter(x=var , y="log_price", alpha=0.35, s=1.5, ax=ax)
return ax
var_scatter(df);
Linear Regression#
Let’s dive in by studying the “Hello World” of regression algorithms: linear regression.
Suppose we would like to predict the log of the sale price of a home, given only the livable square footage of the home.
The linear regression model for this situation is
\(\beta_0\) and \(\beta_1\) are called parameters (also coefficients or weights). The machine learning algorithm is tasked with finding the parameter values that best fit the data.
\(\epsilon\) is the error term. It would be unusual for the observed \(\log(\text{price})\) to be an exact linear function of \(\text{sqft\_living}\). The error term captures the deviation of \(\log(\text{price})\) from a linear function of \(\text{sqft\_living}\).
The linear regression algorithm will choose the parameters that minimize the mean squared error (MSE) function, which for our example is written.
The output of this algorithm is the straight line (hence linear) that passes as close to the points on our scatter chart as possible.
The sns.lmplot
function below will plot our scatter chart and draw the
optimal linear regression line through the data.
sns.lmplot(
data=df, x="sqft_living", y="log_price", height=6,
scatter_kws=dict(s=1.5, alpha=0.35)
);
Let’s use sklearn
to replicate the figure ourselves.
First, we fit the model.
# import
from sklearn import linear_model
# construct the model instance
sqft_lr_model = linear_model.LinearRegression()
# fit the model
sqft_lr_model.fit(X[["sqft_living"]], y)
# print the coefficients
beta_0 = sqft_lr_model.intercept_
beta_1 = sqft_lr_model.coef_[0]
print(f"Fit model: log(price) = {beta_0:.4f} + {beta_1:.4f} sqft_living")
Fit model: log(price) = 12.2185 + 0.0004 sqft_living
Then, we construct the plot.
ax = var_scatter(df)
# points for the line
x = np.array([0, df["sqft_living"].max()])
ax.plot(x, beta_0 + beta_1*x)
[<matplotlib.lines.Line2D at 0x7f4b96cd6c90>]
We can call the predict
method on our model to evaluate the model at
arbitrary points.
For example, we can ask the model to predict the sale price of a 5,000-square-foot home.
# Note, the argument needs to be two-dimensional. You'll see why shortly.
logp_5000 = sqft_lr_model.predict([[5000]])[0]
print(f"The model predicts a 5,000 sq. foot home would cost {np.exp(logp_5000):.2f} dollars")
The model predicts a 5,000 sq. foot home would cost 1486889.32 dollars
/home/runner/miniconda3/envs/lecture-datascience/lib/python3.12/site-packages/sklearn/base.py:493: UserWarning: X does not have valid feature names, but LinearRegression was fitted with feature names
warnings.warn(
Exercise
See exercise 1 in the exercise list.
Exercise
See exercise 2 in the exercise list.
Multivariate Linear Regression#
Our example regression above is called a univariate linear regression since it uses a single feature.
In practice, more features would be used.
Suppose that in addition to sqft_living
, we also wanted to use the bathrooms
variable.
In this case, the linear regression model is
We could keep adding one variable at a time, along with a new \(\beta_{j}\) coefficient for the :math:j
th variable, but there’s an easier way.
Let’s write this equation in vector/matrix form as
Notice that we can add as many columns to \(X\) as we’d like and the linear regression model will still be written \(Y = X \beta + \epsilon\).
The mean squared error loss function for the general model is
where \(|| \cdot ||_2\) is the l2-norm.
Let’s fit the linear regression model using all columns in X
.
lr_model = linear_model.LinearRegression()
lr_model.fit(X, y)
LinearRegression()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
LinearRegression()
We just fit a model with 18 variables – just as quickly and easily as fitting the model with 1 variable!
Visualizing a 18-dimensional model is rather difficult, but just so we can see how the
extra features changed our model, let’s make the log price vs sqft\_living
one more time – this time including the prediction from both of our linear models.
ax = var_scatter(df)
def scatter_model(mod, X, ax=None, color=colors[1], x="sqft_living"):
if ax is None:
_, ax = plt.subplots()
ax.scatter(X[x], mod.predict(X), c=color, alpha=0.25, s=1)
return ax
scatter_model(lr_model, X, ax, color=colors[1])
scatter_model(sqft_lr_model, X[["sqft_living"]], ax, color=colors[2])
ax.legend(["data", "full model", "sqft model"])
<matplotlib.legend.Legend at 0x7f4b96d2ffe0>
Exercise
See exercise 3 in the exercise list.
Nonlinear Relationships in Linear Regression#
While it sounds like an oxymoron, a linear regression model can actually include non-linear features.
The distinguishing feature of the linear regression model is that each prediction is generated by taking the dot product (a linear operator) between a feature vector (one row of \(X\)) and a coefficient vector (\(\beta\)).
There is, however, no restriction on what element we include in our feature vector.
Let’s consider an example…
Starting from the sqft_living
-only model, suppose we have a hunch that we
should also include the fraction of square feet above ground.
This last variable can be computed as sqft_above / sqft_living
.
This second feature is nonlinear, but could easily be included as a column in
X
.
Let’s see this in action.
X2 = X[["sqft_living"]].copy()
X2["pct_sqft_above"] = X["sqft_above"] / X["sqft_living"]
sqft_above_lr_model = linear_model.LinearRegression()
sqft_above_lr_model.fit(X2, y)
new_mse = metrics.mean_squared_error(y, sqft_above_lr_model.predict(X2))
old_mse = metrics.mean_squared_error(y, sqft_lr_model.predict(X2[["sqft_living"]]))
print(f"The mse changed from {old_mse:.4f} to {new_mse:.4f} by including our new feature")
The mse changed from 0.1433 to 0.1430 by including our new feature
Exercise
See exercise 4 in the exercise list.
Determining which columns belong in \(X\) is called feature engineering and is a large part of a machine learning practitioner’s job.
You may recall from (or will see in) your econometrics course(s) that choosing which control variables to include in a regression model is an important part of applied research.
Interpretability#
Before moving to our next regression model, we want to broach the idea of the interpretability of models.
An interpretable model is one for which we can analyze the coefficients to explain why it makes its predictions.
Recall \(\beta_0\) and \(\beta_1\) from the univariate model.
The interpretation of the model is that \(\beta_0\) captures the notion of the average or starting house price and \(\beta_1\) is the additional value per square foot.
Concretely, we have
beta_0, beta_1
(12.218464096380853, 0.0003987465387451504)
which means that our model predicts the log price of a house to be 12.22, plus an additional 0.0004 for every square foot.
Using more exotic machine learning methods might potentially be more accurate but less interpretable.
The accuracy vs interpretability tradeoff is a hot discussion topic, especially relating to concepts like machine learning ethics. This is something you should be aware of as you continue to learn about these techniques.
Lasso Regression#
Lasso regression is very closely related to linear regression.
The lasso model also generates predictions using \(y = X \beta\) but optimizes over a slightly different loss function.
The optimization problem solved by lasso regression can be written as
where \(|| a ||_1 = \sum_{i=1}^N | a_i|\) is the l1-norm and \(\alpha\) is called the regularization parameter.
The additional term penalizes large coefficients and in practice, effectively sets coefficients to zero for features that are not informative about the target.
Let’s see an example of what this looks like using the full feature set in
X
.
lasso_model = linear_model.Lasso()
lasso_model.fit(X, y)
lasso_coefs = pd.Series(dict(zip(list(X), lasso_model.coef_)))
lr_coefs = pd.Series(dict(zip(list(X), lr_model.coef_)))
coefs = pd.DataFrame(dict(lasso=lasso_coefs, linreg=lr_coefs))
coefs
lasso | linreg | |
---|---|---|
bedrooms | -0.000000e+00 | -1.220820e-02 |
bathrooms | 0.000000e+00 | 6.912370e-02 |
sqft_living | 3.007615e-04 | 9.573470e-05 |
sqft_lot | 2.736772e-07 | 4.711823e-07 |
floors | 0.000000e+00 | 7.515336e-02 |
waterfront | 0.000000e+00 | 3.711951e-01 |
view | 0.000000e+00 | 6.040466e-02 |
condition | 0.000000e+00 | 6.263658e-02 |
grade | 0.000000e+00 | 1.589338e-01 |
sqft_above | -0.000000e+00 | 4.022408e-05 |
sqft_basement | 1.614446e-05 | 5.551062e-05 |
yr_built | -1.159230e-03 | -3.410520e-03 |
yr_renovated | 8.070315e-05 | 3.659044e-05 |
zipcode | 7.111990e-04 | -6.458836e-04 |
lat | 0.000000e+00 | 1.399994e+00 |
long | -0.000000e+00 | -1.591565e-01 |
sqft_living15 | 2.038719e-04 | 9.856623e-05 |
sqft_lot15 | -1.042168e-06 | -2.610066e-07 |
Notice that many coefficients from the lasso regression have been set to zero.
The intuition here is that the corresponding features hadn’t provided enough predictive power to be worth considering alongside the other features.
The default value for the \(\alpha\) parameter is 1.0.
Larger \(\alpha\) values cause coefficients to shrink (and maybe additional ones to be thrown out).
# Compute lasso for many alphas (the lasso path)
from itertools import cycle
alphas = np.exp(np.linspace(10, -2, 50))
alphas, coefs_lasso, _ = linear_model.lasso_path(X, y, alphas=alphas, max_iter=10000)
# plotting
fig, ax = plt.subplots(figsize=(12, 8))
color_cycle = cycle(colors)
log_alphas = -np.log10(alphas)
for coef_l, c, name in zip(coefs_lasso, color_cycle, list(X)):
ax.plot(log_alphas, coef_l, c=c)
ax.set_xlabel('-Log(alpha)')
ax.set_ylabel('lasso coefficients')
ax.set_title('Lasso Path')
ax.axis('tight')
maxabs = np.max(np.abs(coef_l))
i = [idx for idx in range(len(coef_l)) if abs(coef_l[idx]) >= (0.9*maxabs)][0]
xnote = log_alphas[i]
ynote = coef_l[i]
ax.annotate(name, (xnote, ynote), color=c)
Overfitting and Regularization#
You might be wondering, “Why would we ever want to throw variables out, can’t that only hurt our model?”
The primary answer: it helps us avoid a common issue called overfitting.
Overfitting occurs when a model that specializes its coefficients too much on the data it was trained on, and only to perform poorly when predicting on data outside the training set.
The extreme example of overfitting is a model that can perfectly memorize the training data, but can do no better than just randomly guess when predicting on a new observation.
The techniques applied to reduce overfitting are known as regularization.
Regularization is an attempt to limit a model’s ability to specialize too narrowly on training data (e.g. limit overfitting) by penalizing extreme values of the model’s parameters.
The additional term in the lasso regression loss function (\(\alpha ||\beta||_1\)) is a form of regularization.
Let’s demonstrate the overfitting and regularization phenomenon on our housing price data as follows:
Split the data set into training and testing subsets. We will use the first 50 observations for training and the rest for testing.
Fit the linear regression model and report MSE on training and testing datasets.
Fit the lasso model and report the same statistics.
def fit_and_report_mses(mod, X_train, X_test, y_train, y_test):
mod.fit(X_train, y_train)
return dict(
mse_train=metrics.mean_squared_error(y_train, mod.predict(X_train)),
mse_test=metrics.mean_squared_error(y_test, mod.predict(X_test))
)
n_test = 50
X_train = X.iloc[:n_test, :]
X_test = X.iloc[n_test:, :]
y_train = y.iloc[:n_test]
y_test = y.iloc[n_test:]
fit_and_report_mses(linear_model.LinearRegression(), X_train, X_test, y_train, y_test)
{'mse_train': 0.05545627171577248, 'mse_test': 0.6705205330634351}
fit_and_report_mses(linear_model.Lasso(), X_train, X_test, y_train, y_test)
{'mse_train': 0.1062919502412531, 'mse_test': 0.2910434823708495}
Notice how the MSE on the training dataset was smaller for the linear model without the regularization, but the MSE on the test dataset was much higher.
This strongly suggests that the linear regression model was overfitting.
The regularization parameter has a large impact on overfitting.
alphas = np.exp(np.linspace(10, -5, 100))
mse = pd.DataFrame([fit_and_report_mses(linear_model.Lasso(alpha=alpha, max_iter=50000),
X_train, X_test, y_train, y_test)
for alpha in alphas])
mse["log_alpha"] = -np.log10(alphas)
fig, ax = plt.subplots(figsize=(10,6))
mse.plot(x="log_alpha", y="mse_test", c=colors[0], ax=ax)
mse.plot(x="log_alpha", y="mse_train", c=colors[1], ax=ax)
ax.set_xlabel(r"$-\log(\alpha)$")
ax.set_ylabel("MSE")
ax.get_legend().remove()
ax.annotate("test",(mse.log_alpha[15], mse.mse_test[15]),color=colors[0])
ax.annotate("train",(mse.log_alpha[30], mse.mse_train[30]),color=colors[1])
Text(-2.368878992199555, 0.2249868810554537, 'train')
Cross-validation of Regularization Parameter#
As you can see in the figure above, the regularization parameter has a large impact on MSE in the test data.
Moreover, the relationship between the test data MSE and \(\alpha\) is complicated and non-monotonic.
One popular method for choosing the regularization parameter is cross-validation.
Roughly speaking, cross-validation splits the dataset into many training/testing subsets, then chooses the regularization parameter value that minimizes the average MSE.
More precisely, k-fold cross-validation does the following:
Partition the dataset randomly into k subsets/”folds”.
Compute \(MSE_j(\alpha)=\) mean squared error in j-th subset when using the j-th subset as test data, and other k-1 as training data.
Minimize average (across folds) MSE \(\min_\alpha \frac{1}{k} \sum_{j=1}^k MSE_j(\alpha)\).
The following code plots 5-fold, cross-validated MSE as a function of \(\alpha\), using the same training data as above.
from sklearn.model_selection import cross_val_score
mse["cv"] = [-np.mean(cross_val_score(linear_model.Lasso(alpha=alpha, max_iter=50000),
X_train, y_train, cv=5, scoring='neg_mean_squared_error'))
for alpha in alphas]
mse.plot(x="log_alpha", y="cv", c=colors[2], ax=ax)
ax.annotate("5 fold cross-validation", (mse.log_alpha[40], mse.cv[40]), color=colors[2])
ax.get_legend().remove()
ax.set_xlabel(r"$-\log(\alpha)$")
ax.set_ylabel("MSE")
fig
scikit learn also includes methods to automate the above and select \(\alpha\).
# LassoCV exploits special structure of lasso problem to minimize CV more efficiently
lasso = linear_model.LassoCV(cv=5).fit(X_train,y_train)
-np.log10(lasso.alpha_) # should roughly = minimizer on graph, not exactly equal due to random splitting
-1.9732104006516822
Holdout#
Practitioners often use another technique to avoid overfitting, called holdout.
We demonstrated an extreme example of applying holdout above when we used only the first 50 observations to train our models.
In general, good practice is to split the entire dataset into a training subset and testing or validation subset.
The splitting should be done randomly. It should leave enough data in the training dataset to produce a good model, but also enough in the validation subset to determine the degree of overfitting.
There aren’t hard and fast rules for how much data to put in each subset, but a reasonable default uses 75% of the data for training and the rest for testing.
As in the example above, the training data is often further split while selecting regularization parameters with cross-validation.
The sklearn
function model_selection.train_test_split
will do this for you:
# note test_size=0.25 is the default value, but is shown here so you
# can see how to change it
X_train, X_test, y_train, y_test = model_selection.train_test_split(X, y, test_size=0.25)
Exercise
See exercise 5 in the exercise list.
Lasso in Econometrics#
Lasso is becoming increasingly popular in economics, partially due to the work of Victor Chernozhukov and his coauthors.
In econometrics, the goal is typically to estimate some coefficient of interest or causal effect, rather than obtaining the most precise prediction.
Among other things, this different goal affects how the regularization parameter must be chosen.
[regBC11] and [regCHS16] are somewhat approachable introductions to this area.
The latter of these two references includes an R package.
[regCCD+18] reflects close to the state-of-art in this rapidly advancing area.
Random Forests#
Random forests are also becoming increasingly popular in economics, thanks to the work of Susan Athey and her coauthors.
[regAI17] gives a very brief overview of this work, and [regAI18] are video lectures and associated code aimed at a broader audience.
Regression Trees#
To understand a forest, we must first understand trees.
We will begin by fitting a tree to this simulated data.
import numpy as np
# Simulate some data and plot it
n = 1000
Xsim = np.random.rand(n,2)
def Ey_x(x):
return 1/3*(np.sin(5*x[0])*np.sqrt(x[1])*np.exp(-(x[1]-0.5)**2))
ysim = np.apply_along_axis(Ey_x, 1, Xsim) + np.random.randn(n)*0.1
import plotly.graph_objects as go
def surface_scatter_plot(X,y,f, xlo=0., xhi=1., ngrid=50,
width=860, height=700, f0=Ey_x, show_f0=False):
scatter = go.Scatter3d(x=X[:,0],y=X[:,1],z=y,
mode='markers',
marker=dict(size=2, opacity=0.3)
)
xgrid = np.linspace(xlo,xhi,ngrid)
ey = np.zeros((len(xgrid),len(xgrid)))
ey0 = np.zeros((len(xgrid),len(xgrid)))
colorscale = [[0, colors[0]], [1, colors[2]]]
for i in range(len(xgrid)):
for j in range(len(xgrid)):
ey[j,i] = f([xgrid[i],xgrid[j]])
ey0[j,i]= f0([xgrid[i],xgrid[j]])
surface = go.Surface(x=xgrid, y=xgrid, z=ey, colorscale=colorscale, opacity=1.0)
if (show_f0):
surface0 = go.Surface(x=xgrid, y=xgrid, z=ey0, opacity=0.8, colorscale=colorscale)
layers = [scatter, surface, surface0]
else:
layers = [scatter, surface]
fig = go.FigureWidget(
data=layers,
layout = go.Layout(
autosize=True,
scene=dict(
xaxis_title='X1',
yaxis_title='X2',
zaxis_title='Y'
),
width=width,
height=height,
)
)
return fig
fig = surface_scatter_plot(Xsim, ysim, Ey_x)
fig
We now fit a regression tree to this data and plot the predicted regression surface.
from sklearn import tree
fitted_tree = tree.DecisionTreeRegressor(max_depth=3).fit(Xsim,ysim)
fig=surface_scatter_plot(
Xsim, ysim, lambda x: fitted_tree.predict([x])[0], show_f0=True
)
fig
As you can see, predictions from regression trees are piecewise-constant on rectangular regions. The boundaries of these regions are determined by a decision tree. The following code displays the decision graph.
import graphviz
tree_graph = tree.export_graphviz(fitted_tree, out_file=None,
feature_names=["X1", "X2"],
filled=True, rounded=True,
special_characters=True)
display(graphviz.Source(tree_graph))
Regression trees are formed iteratively.
We begin with a rectangular region \(R\) containing all values of X.
We then choose a feature and location to split on, aiming to minimize MSE.
We then repeat to generate all the branches.
For each region, solve
Repeat with each of the two smaller rectangles.
Stop when \(|R| =\) some chosen minimum size or when depth of tree \(=\) some chosen maximum.
Prune tree.
This tree-building algorithm has many variations, but every variation 1) shares some rule to decide on a splitting variable and location and 2) has a stopping rule (though not necessarily the same one).
For example, some algorithms stop splitting into new branches when the improvement in MSE becomes small.
As with lasso, regression trees also involve some regularization.
In the above description, the minimum leaf size, maximum tree depth, and \(\alpha\) in the pruning step serve as regularization parameters.
Exercise
See exercise 6 in the exercise list.
Exercise
See exercise 7 in the exercise list.
An advantage of regression trees (and random forests) is that they adapt automatically to feature scales and units.
For example, including zip code in linear regression or lasso was a little
strange. While zip codes are numerical in value, they actually represent categorical
variables. (i.e. we can compute 10025
- 85001
(NYC and Phoenix), but the numerical difference is meaningless.)
Including an indicator or dummy variable for each zip code would make more sense.
Regression trees do not impose linearity or even monotonicity, so having the numeric zip code as a feature is less harmful than it could have been.
ax = var_scatter(df, var="zipcode")
zip_tree = tree.DecisionTreeRegressor(max_depth=10).fit(X[["zipcode"]],y)
scatter_model(zip_tree, X[["zipcode"]], ax, x="zipcode", color="red")
<Axes: xlabel='zipcode', ylabel='log_price'>
Random Forests#
A random forest is the average of many randomized regression trees.
Trees are randomized by:
Fitting on randomly resampled subsets of data
Randomize features chosen for branching
where \(S\) is a random subset of features.
Randomizing and averaging smooths out the predictions from individual trees.
This improves predictions and reduces the variance of the predictions.
# example of forest for simulated data
from sklearn.ensemble import RandomForestRegressor
forest = RandomForestRegressor(n_estimators = 10).fit(Xsim,ysim)
fig=surface_scatter_plot(Xsim,ysim,lambda x: forest.predict([x])[0],
show_f0=True)
fig
Random forests generally produce more accurate predictions than any single tree.
However, random forests have at least two downsides compared to trees: longer computation time and more difficulty in interpretation.
We can no longer draw a single decision graph.
Instead, people often report “feature importance” for random forests.
Feature importance is the average MSE decrease caused by splits on each feature.
If a given feature has greater importance, the trees split on that feature more often and/or splitting on that feature resulted in larger MSE decreases.
forest.feature_importances_
array([0.7932192, 0.2067808])
Exercise
See exercise 8 in the exercise list.
Neural Networks#
The final regression algorithm we will discuss in this lecture is a type of neural network.
If you’re interested in this course, you have probably heard about neural networks in the news or social media.
The purpose of this section is not to give an exhaustive overview of the topic, but instead, to introduce you to a particular neural network model and present it from a different perspective that hopefully complements materials you may later encounter.
Mathematical Background#
If linear regression is the “Hello World” of regression algorithms, then the multi-layer perceptron (MLP) is the “Hello World” of neural networks.
We’ll start with a single (hidden) layer MLP and then build up to the general form.
The prediction function for a single layer MLP is
What we have here is nested linear regression (the \((\cdot) w_i + b_i\) parts), separated by an activation function (the \(f_1\)).
Let’s unpack what happens, starting from our \(N_\text{samples} \times N_\text{features}\) feature matrix \(X\).
First, \(X\) is multiplied by a coefficient matrix \(w_1\). \(w_1\) is often called the weight matrix or weights for short and has dimension \(N_{\text{features}} \times N_1\).
The vector \(b_1\) is added to each row. \(b_1\) is often called the bias vector or bias for short and has dimension \(N_1 \times 1\).
The function \(f_1\) is then applied. Typically \(f_1\) a non-linear function that is applied separately to each element. \(f_1\) is called the activation function.
The output is then multiplied by a weight matrix \(w_2\) with dimension \(N_1 \times 1\).
Finally, a scalar \(b_2\) is added to each row to generate the final prediction with dimension \(N_{\text{samples}} \times 1\).
The way we might write this in Python is:
y = f(X@w1 + b1)@w2 + b2
In order to build an \(N\)-hidden layer MLP, we will nest additional linear regressions separated by activation functions.
The equation for this case is difficult to express, but has the following form
where the \(\cdots\) represents layers 3 to \(N\).
Notice the pattern of a linear regression (\((\cdot) w + b\)), followed by applying an activation function (\(f\)) at each step.
Exercise
See exercise 9 in the exercise list.
The loss or error function typically used when using an MLP for regression is our now familiar mean squared error loss function:
where \(\hat{y}\) is the output of the neural network.
Here, we fit a neural network to the same simulated data that we used in the random forests section.
from sklearn import neural_network
nn = neural_network.MLPRegressor((6,), activation="logistic",
verbose=True, solver="lbfgs",
alpha=0.0).fit(Xsim,ysim)
fig=surface_scatter_plot(Xsim,ysim,lambda x: nn.predict([x])[0], show_f0=True)
fig
This problem is unconstrained.
RUNNING THE L-BFGS-B CODE
* * *
Machine precision = 2.220D-16
N = 25 M = 10
At X0 0 variables are exactly at the bounds
At iterate 0 f= 1.60327D-01 |proj g|= 5.34560D-01
At iterate 1 f= 1.68523D-02 |proj g|= 5.10573D-03
At iterate 2 f= 1.48259D-02 |proj g|= 4.65663D-03
At iterate 3 f= 1.04113D-02 |proj g|= 1.20142D-02
At iterate 4 f= 1.02899D-02 |proj g|= 1.51017D-02
At iterate 5 f= 1.01626D-02 |proj g|= 1.41704D-03
At iterate 6 f= 1.00973D-02 |proj g|= 5.06780D-03
At iterate 7 f= 9.96465D-03 |proj g|= 7.42677D-03
At iterate 8 f= 9.92669D-03 |proj g|= 2.96224D-03
At iterate 9 f= 9.91707D-03 |proj g|= 8.07742D-04
At iterate 10 f= 9.91138D-03 |proj g|= 2.01985D-03
At iterate 11 f= 9.88624D-03 |proj g|= 4.94904D-03
At iterate 12 f= 9.80447D-03 |proj g|= 9.54947D-03
At iterate 13 f= 9.60344D-03 |proj g|= 1.14903D-02
At iterate 14 f= 9.44857D-03 |proj g|= 2.18151D-03
At iterate 15 f= 9.44342D-03 |proj g|= 2.87760D-03
At iterate 16 f= 9.40579D-03 |proj g|= 1.62072D-03
At iterate 17 f= 9.39826D-03 |proj g|= 8.53619D-04
At iterate 18 f= 9.37970D-03 |proj g|= 2.97412D-03
At iterate 19 f= 9.36853D-03 |proj g|= 2.52931D-03
At iterate 20 f= 9.33223D-03 |proj g|= 1.40644D-03
At iterate 21 f= 9.23199D-03 |proj g|= 1.15537D-02
At iterate 22 f= 8.87393D-03 |proj g|= 2.01233D-03
At iterate 23 f= 8.68773D-03 |proj g|= 5.12278D-03
At iterate 24 f= 8.27859D-03 |proj g|= 2.17024D-03
At iterate 25 f= 8.06537D-03 |proj g|= 9.93757D-03
At iterate 26 f= 7.98253D-03 |proj g|= 2.69662D-03
At iterate 27 f= 7.88308D-03 |proj g|= 1.99813D-03
At iterate 28 f= 7.40010D-03 |proj g|= 1.16625D-02
At iterate 29 f= 7.22614D-03 |proj g|= 7.55860D-03
At iterate 30 f= 7.14615D-03 |proj g|= 1.20969D-02
At iterate 31 f= 7.06317D-03 |proj g|= 3.34451D-03
At iterate 32 f= 7.03695D-03 |proj g|= 1.98941D-03
At iterate 33 f= 7.01745D-03 |proj g|= 1.15439D-03
At iterate 34 f= 7.00334D-03 |proj g|= 2.09866D-03
At iterate 35 f= 6.96239D-03 |proj g|= 4.32848D-03
At iterate 36 f= 6.86495D-03 |proj g|= 7.52115D-03
At iterate 37 f= 6.68700D-03 |proj g|= 1.05365D-02
At iterate 38 f= 6.46277D-03 |proj g|= 5.97582D-03
At iterate 39 f= 6.37748D-03 |proj g|= 9.18538D-03
At iterate 40 f= 6.21336D-03 |proj g|= 2.69701D-03
At iterate 41 f= 6.18002D-03 |proj g|= 5.76160D-04
At iterate 42 f= 6.17390D-03 |proj g|= 3.06348D-04
At iterate 43 f= 6.17165D-03 |proj g|= 1.12136D-03
At iterate 44 f= 6.16859D-03 |proj g|= 3.18853D-04
At iterate 45 f= 6.16412D-03 |proj g|= 6.77600D-04
At iterate 46 f= 6.15688D-03 |proj g|= 1.79413D-03
At iterate 47 f= 6.14088D-03 |proj g|= 3.18492D-03
At iterate 48 f= 6.09598D-03 |proj g|= 1.27218D-03
At iterate 49 f= 6.05516D-03 |proj g|= 3.20392D-03
At iterate 50 f= 6.00774D-03 |proj g|= 8.79210D-04
At iterate 51 f= 5.99768D-03 |proj g|= 3.16541D-03
At iterate 52 f= 5.99587D-03 |proj g|= 1.43950D-03
At iterate 53 f= 5.98512D-03 |proj g|= 3.46133D-03
At iterate 54 f= 5.96713D-03 |proj g|= 2.49854D-03
At iterate 55 f= 5.92163D-03 |proj g|= 2.45866D-03
At iterate 56 f= 5.90326D-03 |proj g|= 3.89292D-03
At iterate 57 f= 5.87437D-03 |proj g|= 1.63581D-03
At iterate 58 f= 5.85845D-03 |proj g|= 4.32165D-03
At iterate 59 f= 5.82867D-03 |proj g|= 7.24685D-04
At iterate 60 f= 5.81776D-03 |proj g|= 2.92820D-03
At iterate 61 f= 5.80715D-03 |proj g|= 2.28658D-03
At iterate 62 f= 5.78409D-03 |proj g|= 2.22744D-03
At iterate 63 f= 5.76532D-03 |proj g|= 2.39708D-03
At iterate 64 f= 5.74772D-03 |proj g|= 5.81770D-04
At iterate 65 f= 5.74036D-03 |proj g|= 1.35781D-03
At iterate 66 f= 5.73413D-03 |proj g|= 1.06685D-03
At iterate 67 f= 5.72537D-03 |proj g|= 8.18816D-04
At iterate 68 f= 5.72213D-03 |proj g|= 1.98812D-03
At iterate 69 f= 5.71604D-03 |proj g|= 1.00104D-04
At iterate 70 f= 5.71207D-03 |proj g|= 8.94915D-04
At iterate 71 f= 5.70426D-03 |proj g|= 5.65597D-04
At iterate 72 f= 5.68885D-03 |proj g|= 1.52686D-03
At iterate 73 f= 5.68222D-03 |proj g|= 3.74601D-03
At iterate 74 f= 5.66483D-03 |proj g|= 4.88881D-03
At iterate 75 f= 5.60166D-03 |proj g|= 4.12363D-03
At iterate 76 f= 5.55490D-03 |proj g|= 4.69319D-03
At iterate 77 f= 5.46708D-03 |proj g|= 2.25327D-03
At iterate 78 f= 5.34508D-03 |proj g|= 1.21927D-03
At iterate 79 f= 5.32872D-03 |proj g|= 7.81540D-03
At iterate 80 f= 5.17629D-03 |proj g|= 4.13109D-03
At iterate 81 f= 5.14056D-03 |proj g|= 1.55078D-03
At iterate 82 f= 5.10271D-03 |proj g|= 4.48964D-03
At iterate 83 f= 5.09252D-03 |proj g|= 1.18761D-03
At iterate 84 f= 5.08228D-03 |proj g|= 1.24218D-03
At iterate 85 f= 5.05235D-03 |proj g|= 2.04408D-03
At iterate 86 f= 5.04271D-03 |proj g|= 6.23369D-04
At iterate 87 f= 5.03723D-03 |proj g|= 3.87559D-04
At iterate 88 f= 5.02911D-03 |proj g|= 1.88061D-03
At iterate 89 f= 5.02196D-03 |proj g|= 1.53289D-03
At iterate 90 f= 5.01416D-03 |proj g|= 1.37996D-03
At iterate 91 f= 5.00442D-03 |proj g|= 5.03467D-04
At iterate 92 f= 4.99542D-03 |proj g|= 1.44505D-03
At iterate 93 f= 4.98709D-03 |proj g|= 1.37548D-03
At iterate 94 f= 4.98573D-03 |proj g|= 5.28734D-04
At iterate 95 f= 4.98153D-03 |proj g|= 5.33027D-04
At iterate 96 f= 4.97516D-03 |proj g|= 8.31247D-04
At iterate 97 f= 4.96875D-03 |proj g|= 1.81205D-04
At iterate 98 f= 4.96337D-03 |proj g|= 1.29722D-03
At iterate 99 f= 4.95996D-03 |proj g|= 2.38150D-03
At iterate 100 f= 4.95830D-03 |proj g|= 2.63459D-04
At iterate 101 f= 4.95758D-03 |proj g|= 6.32125D-04
At iterate 102 f= 4.95642D-03 |proj g|= 9.28950D-04
At iterate 103 f= 4.95555D-03 |proj g|= 5.37568D-04
At iterate 104 f= 4.95482D-03 |proj g|= 1.48454D-04
At iterate 105 f= 4.95464D-03 |proj g|= 1.07764D-04
At iterate 106 f= 4.95449D-03 |proj g|= 2.24515D-04
At iterate 107 f= 4.95446D-03 |proj g|= 1.18516D-04
At iterate 108 f= 4.95444D-03 |proj g|= 7.67186D-05
* * *
Tit = total number of iterations
Tnf = total number of function evaluations
Tnint = total number of segments explored during Cauchy searches
Skip = number of BFGS updates skipped
Nact = number of active bounds at final generalized Cauchy point
Projg = norm of the final projected gradient
F = final function value
* * *
N Tit Tnf Tnint Skip Nact Projg F
25 108 140 1 0 0 7.672D-05 4.954D-03
F = 4.9544387870296257E-003
CONVERGENCE: NORM_OF_PROJECTED_GRADIENT_<=_PGTOL
We are nearly ready to test out a MLP on our housing data, but there are a few more talking points to cover:
[regHSW89] show that MLPs are universal approximators, meaning they are theoretically capable of approximating any function. This fact is sometimes stated as though it helps explain the exceptionally good predictive ability of neural networks. Do not be fooled by this fallacy. Many other methods are universal approximators, including regression trees and more classic statistical methods like kernel regression. The explanation for neural networks’ predictive success lies elsewhere. 1
The activation functions must be non-linear. If they were not, the MLP would be combining linear combinations of linear combinations and would therefore always be linear.
The hidden layer structure of an MLP allows it to automatically perform feature engineering. Contrastly,the example we had above required us to manually engineer the square feet above ground feature.
Application#
Ok, now let’s try out our first neural network!
from sklearn import neural_network
X = df.drop(["price", "date", "id", "log_price"], axis=1).copy()
for col in list(X):
X[col] = X[col].astype(float)
y = np.log(df["price"])
# two hidden layers, with N1=30 and N2=20
nn_model = neural_network.MLPRegressor((30, 20))
nn_model.fit(X, y)
ax = var_scatter(df)
scatter_model(nn_model, X, ax=ax)
<Axes: xlabel='sqft_living', ylabel='log_price'>
Wow! That plot looks horrible. Let’s check the MSE.
mse_nn = metrics.mean_squared_error(y, nn_model.predict(X))
mse_nn / metrics.mean_squared_error(y, lr_model.predict(X))
98569.3265361064
So… after all that talk about neural networks being all-powerful, our neural network above produced a mean squared error which was tens of thousands of times larger than the MSE from a linear regression!
Input Scaling#
The issue here is that neural networks are extremely sensitive to the scale (both relative and absolute) of the input features.
The reasons for why are a bit beyond the scope of this lecture, but the main idea is that the training procedure pays too much attention to relatively larger features (relative scale) and becomes unstable if features are very large (absolute scale).
A common technique to overcome this issue is to scale each variable so that the observations have a mean of 0 and a standard deviation of 1.
This is known as scaling or normalizing the inputs.
Exercise
See exercise 10 in the exercise list.
If we decide to scale our variables, we must remember to apply the same transformation at prediction time as we did when we fit the model.
In practice, we must do three things:
Store the mean and standard deviation of each feature in the training set.
Subtract each feature’s mean from the training data and then divide by the feature’s standard deviation before fitting.
Subtract the training data’s mean and divide by training data’s standard deviation for all prediction inputs.
Applying the transformation to the prediction data is easily forgotten, so this is a tedious and somewhat error-prone process.
Thankfully, scikit-learn has a way to automate the process and ensure that it is always applied.
Let’s see an example:
from sklearn import preprocessing, pipeline
# the pipeline defines any number of steps that will be applied
# to transform the `X` data and then a final step that is a model
# we can use for prediction
nn_scaled_model = pipeline.make_pipeline(
preprocessing.StandardScaler(), # this will do the input scaling
neural_network.MLPRegressor((30, 20)) # put your favorite model here
)
# We can now use `model` like we have used our other models all along
# Call fit
nn_scaled_model.fit(X, y)
# Call predict
mse_nn_scaled = metrics.mean_squared_error(y, nn_scaled_model.predict(X))
print(f"Unscaled mse {mse_nn}")
print(f"Scaled mse {mse_nn_scaled}")
Unscaled mse 6276.411001222364
Scaled mse 0.029512809243241008
There we have it, much better. This is the smallest MSE we have seen so far.
A scatter plot of the predictions looks very similar to the observed prices.
ax = var_scatter(df)
scatter_model(nn_scaled_model, X, ax=ax)
<Axes: xlabel='sqft_living', ylabel='log_price'>
Tradeoffs#
So we’ve seen that neural networks are very flexible and can approximate highly nonlinear functions – but they have a few cons.
We’ll discuss a few of them here.
Interpretability: Unlike linear regression or lasso, neural networks are not easily interpretable. We could look at the \(w\) matrices or \(b\) vectors, but the nested composition and nonlinear activation functions make it very difficult to interpret just how each coefficient impacts the output. In settings like making economic policy recommendations or decisions with ethical consequences (e.g. approving loans, screening), the lack of interpretability can be a non-starter.
Efficiency/time: Neural networks require more computational power to evaluate (generate predictions) and are orders of magnitude more expensive to train than classical machine learning methods.
Automated feature engineering: The nested linear regressions allow neural networks to learn data features that are composed of arbitrary linear combinations of the original feature set. The non-linear activation functions allow the network to learn arbitrary non-linear features. Manual feature engineering is based largely on the researchers intuition, as well as a fair amount of trial and error. Determining the proper features that allow for more explanatory power without overfitting is very difficult. Neural networks automate that process by using the data itself to guide the training process and select features that satisfy accuracy and regularization conditions.
Overfitting: Neural networks’ flexibility and explanatory power make them easy to accidentally overfit. When training neural networks, the various approaches to regularization should be studied and evaluated, especially when building networks used for decision-making.
Exercise
See exercise 11 in the exercise list.
References#
Two good text books covering the above regression methods are [regFHT09] and [regEH16]
- regAI18
Susan Athey and Guido Imbens. Machine learning and econometrics. 2018. URL: https://www.aeaweb.org/conference/cont-ed/2018-webcasts.
- regAI17
Susan Athey and Guido W. Imbens. The state of applied econometrics: causality and policy evaluation. Journal of Economic Perspectives, 31(2):3–32, May 2017. URL: http://www.aeaweb.org/articles?id=10.1257/jep.31.2.3, doi:10.1257/jep.31.2.3.
- regBC11
Alexandre Belloni and Victor Chernozhukov. High Dimensional Sparse Econometric Models: An Introduction, pages 121–156. Springer Berlin Heidelberg, Berlin, Heidelberg, 2011. URL: https://doi.org/10.1007/978-3-642-19989-9_3, doi:10.1007/978-3-642-19989-9_3.
- regCCD+18
Victor Chernozhukov, Denis Chetverikov, Mert Demirer, Esther Duflo, Christian Hansen, Whitney Newey, and James Robins. Double/debiased machine learning for treatment and structural parameters. The Econometrics Journal, 21(1):C1–C68, 2018. URL: https://onlinelibrary.wiley.com/doi/abs/10.1111/ectj.12097, arXiv:https://onlinelibrary.wiley.com/doi/pdf/10.1111/ectj.12097, doi:10.1111/ectj.12097.
- regCHS16
Victor Chernozhukov, Chris Hansen, and Martin Spindler. hdm: high-dimensional metrics. R Journal, 8(2):185–199, 2016. URL: https://journal.r-project.org/archive/2016/RJ-2016-040/index.html.
- regEH16
Bradley Efron and Trevor Hastie. Computer age statistical inference. Volume 5. Cambridge University Press, 2016. URL: https://web.stanford.edu/~hastie/CASI/.
- regFHT09
Jerome Friedman, Trevor Hastie, and Robert Tibshirani. The elements of statistical learning. Springer series in statistics, 2009. URL: https://web.stanford.edu/~hastie/ElemStatLearn/.
- regHSW89
Kurt Hornik, Maxwell Stinchcombe, and Halbert White. Multilayer feedforward networks are universal approximators. Neural Networks, 2(5):359 – 366, 1989. URL: http://www.sciencedirect.com/science/article/pii/0893608089900208, doi:https://doi.org/10.1016/0893-6080(89)90020-8.
Exercises#
Exercise 1#
Use the sqft_lr_model
that we fit to generate predictions for all data points
in our sample.
Note that you need to pass a DataFrame (not Series)
containing the sqft_living
column to the predict
. (See how we passed that to .fit
above for help)
Make a scatter chart with the actual data and the predictions on the same figure. Does it look familiar?
When making the scatter for model predictions, we recommend passing
c="red"
and alpha=0.25
so you can distinguish the data from
predictions.
# Generate predictions
# Plot
fig, ax = plt.subplots()
# Make scatter of data
# Make scatter of predictions
Exercise 2#
Use the metrics.mean_squared_error
function to evaluate the loss
function used by sklearn
when it fits the model for us.
Read the docstring to learn which the arguments that function takes.
from sklearn import metrics
# your code here
Exercise 3#
Compare the mean squared error for the lr_model
and the sqft_lr_model
.
Which model has a better fit? Defend your choice.
Exercise 4#
Explore how you can improve the fit of the full model by adding additional features created from the existing ones.
# your code here
Exercise 5#
Experiment with how the size of the holdout dataset can impact a diagnosis of overfitting.
Evaluate only the LinearRegression
model on the full feature set and use
the model_selection.train_test_split
function with various values for
test_size
.
Exercise 6#
Read the documentation for sklearn.tree.DecisionTreeRegressor
and
then experiment with adjusting some regularization parameters to see how they
affect the fitted tree.
# plot trees when varying some regularization parameter(s)
Exercise 7#
Fit a regression tree to the housing price data and use graphviz to visualize the decision graph.
# your code here
Exercise 8#
Fit a random forest to the housing price data.
Compare the MSE on a testing set to that of lasso.
# Fit random forest and compute MSE
Produce a bar chart of feature importances for predicting house prices.
Exercise 9#
In the pseudocode below, fill in the blanks for the generic MLP.
Note that this is inside a markdown cell because the code is not valid Python.
ws = [w1, w2, ..., wend]
bs = [b1, b2, ..., bend]
def eval_mlp(X, ws, bs, f):
"""
Evaluate MLP-given weights (ws), bias (bs) and an activation (f)
Assumes that the same activation is applied to all hidden layers
"""
N = len(ws) - 1
out = X
for i in range(N):
out = f(__) # replace the __
# For this step remember python starts counting at 0!
return out@__ + __ # replace the __
Exercise 10#
Scale all variables in X
by subtracting their mean and dividing by the
standard deviation.
Verify that the transformed data has mean 0 and standard deviation 1.
# your code here
Exercise 11#
Read the documentation for sklearn.neural_network.MLPRegressor and use the full housing data to experiment with how adjusting layer depth, width, and other regularization parameters affects prediction.
# your code here
- 1
Two facts about neural networks are relevant to their predictive success: automatic feature engineering, as mentioned above, and the ability of neural networks to approximate a broad (but not quite universal) class of functions with relatively few parameters. This gives neural networks its fast statistical convergence rate. Under appropriate assumptions, lasso, series regression, and kernel regression share this fast convergence rate property, but they lack automatic feature engineering. On the other hand, random forests have automatic feature engineering, but do not have a fast convergence rate. Neural networks are somewhat unique in combining both properties. See these notes and references therein for more information about convergence rates.