\(\newcommand{\bfbeta}{\boldsymbol{\beta}}\) \(\newcommand{\bflambda}{\boldsymbol{\lambda}}\)

2.5. Application to regression analysis#

We return to our motivating example, the regression problem, and apply the least squares approach.

2.5.1. Linear and polynomial regression#

Linear regression We seek an affine function to fit input data points \(\{(\mathbf{x}_i, y_i)\}_{i=1}^n\). The common approach involves finding coefficients \(\beta_j\)’s that minimize the criterion

\[ \sum_{i=1}^n \left(y_i - \left\{\beta_0 + \sum_{j=1}^d \beta_j x_{ij}\right\}\right)^2. \]

This is indeed a linear least squares problem.

Figure: A regression line (Source)

Regression line

\(\bowtie\)

In matrix form, let

\[\begin{split} \mathbf{y} = \begin{pmatrix} y_1 \\ y_2 \\ \vdots \\ y_n \end{pmatrix}, \quad\quad A = \begin{pmatrix} 1 & \mathbf{x}_1^T \\ 1 & \mathbf{x}_2^T \\ \vdots & \vdots \\ 1 & \mathbf{x}_n^T \end{pmatrix} \quad\text{and}\quad \boldsymbol{\beta} = \begin{pmatrix} \beta_0 \\ \beta_1 \\ \vdots \\ \beta_d \end{pmatrix}. \end{split}\]

Then the problem is

\[ \min_{\boldsymbol{\beta}} \|\mathbf{y} - A \boldsymbol{\beta}\|^2. \]

We assume that the columns of \(A\) are linearly independent, which is typically the case with real data. The normal equations are then

\[ A^T A \boldsymbol{\beta} = A^T \mathbf{y}. \]

NUMERICAL CORNER: We test our least-squares method on simulated data. This has the advantage that we know the truth.

Suppose the truth is a linear function of one variable.

n, b0, b1 = 100, -1, 1
x = np.linspace(0,10,num=n)
y = b0 + b1*x
Hide code cell source
plt.scatter(x,y,alpha=0.5)
plt.show()
../../_images/4f4631d0fdbe3875dd11bf755c1a44fa6b5b50e2b2c209cb8862468a48c251cf.png

A perfect straight line is little too easy. So let’s add some noise. That is, to each \(y_i\) we add an independent random variable \(\varepsilon_i\) with a standard Normal distribution (mean \(0\), variance \(1\)).

y += rng.normal(0,1,n)
Hide code cell source
plt.scatter(x,y,alpha=0.5)
plt.show()
../../_images/bd1fbf836d2ec2a2e8941261296c9f631af6c66490d2bc1103c365864e1d7f95.png

We form the matrix \(A\) and use our least-squares code to solve for \(\boldsymbol{\hat\beta}\). The function ls_by_qr, which we implemented previously, is in mmids.py, which is available on the GitHub of the notes.

A = np.stack((np.ones(n),x),axis=-1)
coeff = mmids.ls_by_qr(A,y)
print(coeff)
[-1.03381171  1.01808039]
Hide code cell source
plt.scatter(x,y,alpha=0.5)
plt.plot(x,coeff[0]+coeff[1]*x,'r')
plt.show()
../../_images/329a6751c633115fc9a72ef93d3eef0a1c8018689be9466c90cb17e7fb4bb9f9.png

\(\unlhd\)

Beyond linearity The linear assumption is not as restrictive as it may first appear. The same approach can be extended straightforwardly to fit polynomials or more complicated combination of functions. For instance, suppose \(d=1\). To fit a second degree polynomial to the data \(\{(x_i, y_i)\}_{i=1}^n\), we add a column to the \(A\) matrix with the squares of the \(x_i\)’s. That is, we let

\[\begin{split} A = \begin{pmatrix} 1 & x_1 & x_1^2 \\ 1 & x_2 & x_2^2 \\ \vdots & \vdots & \vdots \\ 1 & x_n & x_n^2 \end{pmatrix}. \end{split}\]

Then, we are indeed fitting a degree-two polynomial as follows

\[ (A \boldsymbol{\beta})_i = \beta_0 + \beta_1 x_i + \beta_2 x_i^2. \]

The solution otherwise remains the same.

This idea of adding columns can also be used to model interactions between predictors. Suppose \(d=2\). Then we can consider the following \(A\) matrix, where the last column combines both predictors into their product,

\[\begin{split} A = \begin{pmatrix} 1 & x_{11} & x_{12} & x_{11} x_{12} \\ 1 & x_{21} & x_{22} & x_{21} x_{22} \\ \vdots & \vdots & \vdots & \vdots\\ 1 & x_{n1} & x_{n2} & x_{n1} x_{n2} \end{pmatrix}. \end{split}\]

NUMERICAL CORNER: Suppose the truth is in fact a degree-two polynomial of one variable with Gaussian noise.

n, b0, b1, b2 = 100, 0, 0, 1
x = np.linspace(0,10,num=n)
y = b0 + b1 * x + b2 * x**2 + 10*rng.normal(0,1,n)
Hide code cell source
plt.scatter(x,y,alpha=0.5)
plt.show()
../../_images/c510f98fb5c978e39caa95acb95b9564910ee3b915026090433300bef46ca79b.png

We form the matrix \(A\) and use our least-squares code to solve for \(\boldsymbol{\hat\beta}\).

A = np.stack((np.ones(n), x, x**2), axis=-1)
coeff = mmids.ls_by_qr(A,y)
print(coeff)
[-2.76266982  1.01627798  0.93554204]
Hide code cell source
plt.scatter(x,y,alpha=0.5)
plt.plot(x, coeff[0] + coeff[1] * x + coeff[2] * x**2, 'r')
plt.show()
../../_images/fa0328580bea6d13b7206cbec6d38a4f93acaf8874c99a1793fb86426826754d.png

\(\unlhd\)

2.5.2. Overfitting in polynomial regression#

We return to the Advertising dataset from the [ISLP] textbook.

Figure: Pie chart (Credit: Made with Midjourney)

Predicting sales

\(\bowtie\)

df = pd.read_csv('advertising.csv')
df.head()
TV radio newspaper sales
0 230.1 37.8 69.2 22.1
1 44.5 39.3 45.1 10.4
2 17.2 45.9 69.3 9.3
3 151.5 41.3 58.5 18.5
4 180.8 10.8 58.4 12.9

We will focus for now on the TV budget.

TV = df['TV'].to_numpy()
sales = df['sales'].to_numpy()
Hide code cell source
plt.scatter(TV, sales)
plt.xlabel('TV')
plt.ylabel('sales')
plt.show()
../../_images/885b59e7d5053213651ba9cc59f45cb32be70d032493e4d5761714a538bfa307.png

We form the matrix \(A\) and use our least-squares code to solve for \(\boldsymbol{\beta}\).

n = np.size(TV)
A = np.stack((np.ones(n),TV),axis=-1)
coeff = mmids.ls_by_qr(A,sales)
print(coeff)
[7.03259355 0.04753664]
Hide code cell source
TVgrid = np.linspace(TV.min(), TV.max(), num=100)
plt.scatter(TV,sales,alpha=0.5)
plt.plot(TVgrid,coeff[0]+coeff[1]*TVgrid,'r')
plt.show()
../../_images/e45a00f71114f3d989ffef4fd9f0487e7dcb1bc4d7a85f2acd1573c4f266cab6.png

A degree-two polynomial might be a better fit.

A = np.stack((np.ones(n), TV, TV**2), axis=-1)
coeff = mmids.ls_by_qr(A,sales)
print(coeff)
[ 6.11412013e+00  6.72659270e-02 -6.84693373e-05]
Hide code cell source
plt.scatter(TV,sales,alpha=0.5)
plt.plot(TVgrid, coeff[0] + coeff[1] * TVgrid + coeff[2] * TVgrid**2,'r')
plt.show()
../../_images/50297416f68bfd2139a359b1235e8ba1c2c5d713b125e7f40f7008935b4707f6.png

The fit looks slightly better than the linear one. This is not entirely surprising though given that the linear model is a subset of the quadratic one. But, in adding more parameters, one must worry about overfitting. To quote Wikipedia:

In statistics, overfitting is “the production of an analysis that corresponds too closely or exactly to a particular set of data, and may therefore fail to fit additional data or predict future observations reliably”.[1] An overfitted model is a statistical model that contains more parameters than can be justified by the data.[2] The essence of overfitting is to have unknowingly extracted some of the residual variation (i.e. the noise) as if that variation represented underlying model structure.[3]

To illustrate, let’s see what happens with a degree-\(20\) polynomial fit.

deg = 20
A = np.stack([TV**i for i in range(deg+1)], axis=-1)
coeff = mmids.ls_by_qr(A,sales)
print(coeff)
[ 1.06538698e+00  6.72896471e-01 -1.53138969e-02 -2.74088516e-04
  1.83651714e-05 -3.40080020e-07  3.17915742e-09 -1.64042005e-11
  4.43633296e-14 -4.25654490e-17 -5.28727398e-20  1.11822932e-22
 -3.47096893e-27 -2.44665112e-30 -2.79435976e-33 -4.05263859e-36
 -6.83137511e-39 -1.27993830e-41 -2.59569760e-44 -5.59960687e-47
 -1.26949578e-49]
saleshat = np.sum([coeff[i] * TVgrid**i for i in range(deg+1)], axis=0)
Hide code cell source
plt.scatter(TV,sales,alpha=0.5)
plt.plot(TVgrid, saleshat, 'r')
plt.show()
../../_images/af9aa1f598bfb0592d500f8c3725483365d16c9be8de3e743f158a8c7f958cff.png

We could use cross-validation to choose a suitable degree.

2.5.3. Bootstrapping for linear regression#

We return to the linear case, but with the full set of predictors.

radio = df['radio'].to_numpy()
newspaper = df['newspaper'].to_numpy()
Hide code cell source
f, (ax1, ax2) = plt.subplots(1, 2, sharex=False, sharey=True)
ax1.scatter(radio,sales,alpha=0.5)
ax1.set_title('radio')
ax2.scatter(newspaper,sales,alpha=0.5)
ax2.set_title('newspaper')
plt.show()
../../_images/47e37c4531f12c11f07336b29beb90460231009c2affb27b5a2c9a39e5fb4df6.png
A = np.stack((np.ones(n), TV, radio, newspaper), axis=-1)
coeff = mmids.ls_by_qr(A,sales)
print(coeff)
[ 2.93888937e+00  4.57646455e-02  1.88530017e-01 -1.03749304e-03]

Newspaper advertising (the last coefficient) seems to have a much weaker effect on sales per dollar spent. Next, we briefly sketch one way to assess the statistical significance of such a conclusion.

Our coefficients are estimated from a sample. There is intrinsic variability in our sampling procedure. We would like to understand how our estimated coefficients compare to the true coefficients. This is set up beautifully in [Data8, Section 13.2]:

A data scientist is using the data in a random sample to estimate an unknown parameter. She uses the sample to calculate the value of a statistic that she will use as her estimate. Once she has calculated the observed value of her statistic, she could just present it as her estimate and go on her merry way. But she’s a data scientist. She knows that her random sample is just one of numerous possible random samples, and thus her estimate is just one of numerous plausible estimates. By how much could those estimates vary? To answer this, it appears as though she needs to draw another sample from the population, and compute a new estimate based on the new sample. But she doesn’t have the resources to go back to the population and draw another sample. It looks as though the data scientist is stuck. Fortunately, a brilliant idea called the bootstrap can help her out. Since it is not feasible to generate new samples from the population, the bootstrap generates new random samples by a method called resampling: the new samples are drawn at random from the original sample.

Without going into full details (see [DS100, Section 17.3] for more), it works as follows. Let \(\{(\mathbf{x}_i, y_i)\}_{i=1}^n\) be our data. We assume that our sample is representative of the population and we simulate our sampling procedure by resampling from the sample. That is, we take a random sample with replacement \(\mathcal{X}_{\mathrm{boot},1} = \{(\mathbf{x}_i, y_i)\,:\,i \in I\}\) where \(I\) is a multi-set of elements from \([n]\) of size \(n\). We recompute our estimated coefficients on \(\mathcal{X}_{\mathrm{boot},1}\). Then we repeat independently for a desired number of replicates \(\mathcal{X}_{\mathrm{boot},1}, \ldots, \mathcal{X}_{\mathrm{boot},r}\). Plotting a histogram of the resulting coefficients gives some idea of the variability of our estimates.

We implement the bootstrap for linear regression in Python next.

def linregboot(A, b, replicates = np.int32(10000)):
    n,m = A.shape
    coeff_boot = np.zeros((m,replicates))
    for i in range(replicates):
        resample = rng.integers(0,n,n)
        Aboot = A[resample,:]
        bboot = b[resample]
        coeff_boot[:,i] = mmids.ls_by_qr(Aboot,bboot)
    return coeff_boot

First, let’s use a simple example from the lecture with a known ground truth.

n, b0, b1 = 100, -1, 1
x = np.linspace(0,10,num=n)
y = b0 + b1*x + rng.normal(0,1,n)
A = np.stack((np.ones(n),x),axis=-1)

The estimated coefficients are the following.

coeff = mmids.ls_by_qr(A,y)
print(coeff)
[-1.48048802  1.08741565]

Now we apply the bootstrap and plot histograms of the two coefficients.

coeff_boot = linregboot(A,y)
Hide code cell source
plt.hist(coeff_boot[0,:])
plt.show()
../../_images/14f487b4f96140e9760095ff2345b2a86598b0beb51066dbce07122722f28559.png
Hide code cell source
plt.hist(coeff_boot[1,:])
plt.show()
../../_images/918592fe739bc2200b68575253ffa999eb132a7b300f2d0982d2bcf6f4f2915c.png

We see in the histograms that the true coefficient values \(-1\) and \(1\) fall within the likely range.

We return to the Advertising dataset and apply the bootstrap.

n = np.size(TV)
A = np.stack((np.ones(n), TV, radio, newspaper), axis=-1)
coeff = mmids.ls_by_qr(A,sales)
print(coeff)
[ 2.93888937e+00  4.57646455e-02  1.88530017e-01 -1.03749304e-03]
coeff_boot = linregboot(A,sales)

Plotting a histogram of the coefficients corresponding to newspaper advertising shows that \(0\) is a plausible value, while it is not for TV advertising.

Hide code cell source
plt.hist(coeff_boot[1,:])
plt.show()
../../_images/530bff0b36645aa2746143dccc82ce850482d107e1e37748cafdbe8fb8977411.png
Hide code cell source
plt.hist(coeff_boot[3,:])
plt.show()
../../_images/530dfce8d9d9bd89fffaf4e5c32f45d9eb64c042dac3bbfe7ca697eeabf57bbb.png