\(\newcommand{\bfbeta}{\boldsymbol{\beta}}\) \(\newcommand{\bflambda}{\boldsymbol{\lambda}}\)
2.5. Application to regression analysis#
We return to our motivating example, the regression problem, and apply the least squares approach.
2.5.1. Linear and polynomial regression#
Linear regression We seek an affine function to fit input data points \(\{(\mathbf{x}_i, y_i)\}_{i=1}^n\). The common approach involves finding coefficients \(\beta_j\)’s that minimize the criterion
This is indeed a linear least squares problem.
Figure: A regression line (Source)
\(\bowtie\)
In matrix form, let
Then the problem is
We assume that the columns of \(A\) are linearly independent, which is typically the case with real data. The normal equations are then
NUMERICAL CORNER: We test our least-squares method on simulated data. This has the advantage that we know the truth.
Suppose the truth is a linear function of one variable.
n, b0, b1 = 100, -1, 1
x = np.linspace(0,10,num=n)
y = b0 + b1*x
Show code cell source
plt.scatter(x,y,alpha=0.5)
plt.show()
A perfect straight line is little too easy. So let’s add some noise. That is, to each \(y_i\) we add an independent random variable \(\varepsilon_i\) with a standard Normal distribution (mean \(0\), variance \(1\)).
y += rng.normal(0,1,n)
Show code cell source
plt.scatter(x,y,alpha=0.5)
plt.show()
We form the matrix \(A\) and use our least-squares code to solve for \(\boldsymbol{\hat\beta}\). The function ls_by_qr
, which we implemented previously, is in mmids.py, which is available on the GitHub of the notes.
A = np.stack((np.ones(n),x),axis=-1)
coeff = mmids.ls_by_qr(A,y)
print(coeff)
[-1.03381171 1.01808039]
Show code cell source
plt.scatter(x,y,alpha=0.5)
plt.plot(x,coeff[0]+coeff[1]*x,'r')
plt.show()
\(\unlhd\)
Beyond linearity The linear assumption is not as restrictive as it may first appear. The same approach can be extended straightforwardly to fit polynomials or more complicated combination of functions. For instance, suppose \(d=1\). To fit a second degree polynomial to the data \(\{(x_i, y_i)\}_{i=1}^n\), we add a column to the \(A\) matrix with the squares of the \(x_i\)’s. That is, we let
Then, we are indeed fitting a degree-two polynomial as follows
The solution otherwise remains the same.
This idea of adding columns can also be used to model interactions between predictors. Suppose \(d=2\). Then we can consider the following \(A\) matrix, where the last column combines both predictors into their product,
NUMERICAL CORNER: Suppose the truth is in fact a degree-two polynomial of one variable with Gaussian noise.
n, b0, b1, b2 = 100, 0, 0, 1
x = np.linspace(0,10,num=n)
y = b0 + b1 * x + b2 * x**2 + 10*rng.normal(0,1,n)
Show code cell source
plt.scatter(x,y,alpha=0.5)
plt.show()
We form the matrix \(A\) and use our least-squares code to solve for \(\boldsymbol{\hat\beta}\).
A = np.stack((np.ones(n), x, x**2), axis=-1)
coeff = mmids.ls_by_qr(A,y)
print(coeff)
[-2.76266982 1.01627798 0.93554204]
Show code cell source
plt.scatter(x,y,alpha=0.5)
plt.plot(x, coeff[0] + coeff[1] * x + coeff[2] * x**2, 'r')
plt.show()
\(\unlhd\)
2.5.2. Overfitting in polynomial regression#
We return to the Advertising
dataset from the [ISLP] textbook.
Figure: Pie chart (Credit: Made with Midjourney)
\(\bowtie\)
df = pd.read_csv('advertising.csv')
df.head()
TV | radio | newspaper | sales | |
---|---|---|---|---|
0 | 230.1 | 37.8 | 69.2 | 22.1 |
1 | 44.5 | 39.3 | 45.1 | 10.4 |
2 | 17.2 | 45.9 | 69.3 | 9.3 |
3 | 151.5 | 41.3 | 58.5 | 18.5 |
4 | 180.8 | 10.8 | 58.4 | 12.9 |
We will focus for now on the TV budget.
TV = df['TV'].to_numpy()
sales = df['sales'].to_numpy()
Show code cell source
plt.scatter(TV, sales)
plt.xlabel('TV')
plt.ylabel('sales')
plt.show()
We form the matrix \(A\) and use our least-squares code to solve for \(\boldsymbol{\beta}\).
n = np.size(TV)
A = np.stack((np.ones(n),TV),axis=-1)
coeff = mmids.ls_by_qr(A,sales)
print(coeff)
[7.03259355 0.04753664]
Show code cell source
TVgrid = np.linspace(TV.min(), TV.max(), num=100)
plt.scatter(TV,sales,alpha=0.5)
plt.plot(TVgrid,coeff[0]+coeff[1]*TVgrid,'r')
plt.show()
A degree-two polynomial might be a better fit.
A = np.stack((np.ones(n), TV, TV**2), axis=-1)
coeff = mmids.ls_by_qr(A,sales)
print(coeff)
[ 6.11412013e+00 6.72659270e-02 -6.84693373e-05]
Show code cell source
plt.scatter(TV,sales,alpha=0.5)
plt.plot(TVgrid, coeff[0] + coeff[1] * TVgrid + coeff[2] * TVgrid**2,'r')
plt.show()
The fit looks slightly better than the linear one. This is not entirely surprising though given that the linear model is a subset of the quadratic one. But, in adding more parameters, one must worry about overfitting. To quote Wikipedia:
In statistics, overfitting is “the production of an analysis that corresponds too closely or exactly to a particular set of data, and may therefore fail to fit additional data or predict future observations reliably”.[1] An overfitted model is a statistical model that contains more parameters than can be justified by the data.[2] The essence of overfitting is to have unknowingly extracted some of the residual variation (i.e. the noise) as if that variation represented underlying model structure.[3]
To illustrate, let’s see what happens with a degree-\(20\) polynomial fit.
deg = 20
A = np.stack([TV**i for i in range(deg+1)], axis=-1)
coeff = mmids.ls_by_qr(A,sales)
print(coeff)
[ 1.06538698e+00 6.72896471e-01 -1.53138969e-02 -2.74088516e-04
1.83651714e-05 -3.40080020e-07 3.17915742e-09 -1.64042005e-11
4.43633296e-14 -4.25654490e-17 -5.28727398e-20 1.11822932e-22
-3.47096893e-27 -2.44665112e-30 -2.79435976e-33 -4.05263859e-36
-6.83137511e-39 -1.27993830e-41 -2.59569760e-44 -5.59960687e-47
-1.26949578e-49]
saleshat = np.sum([coeff[i] * TVgrid**i for i in range(deg+1)], axis=0)
Show code cell source
plt.scatter(TV,sales,alpha=0.5)
plt.plot(TVgrid, saleshat, 'r')
plt.show()
We could use cross-validation to choose a suitable degree.
2.5.3. Bootstrapping for linear regression#
We return to the linear case, but with the full set of predictors.
radio = df['radio'].to_numpy()
newspaper = df['newspaper'].to_numpy()
Show code cell source
f, (ax1, ax2) = plt.subplots(1, 2, sharex=False, sharey=True)
ax1.scatter(radio,sales,alpha=0.5)
ax1.set_title('radio')
ax2.scatter(newspaper,sales,alpha=0.5)
ax2.set_title('newspaper')
plt.show()
A = np.stack((np.ones(n), TV, radio, newspaper), axis=-1)
coeff = mmids.ls_by_qr(A,sales)
print(coeff)
[ 2.93888937e+00 4.57646455e-02 1.88530017e-01 -1.03749304e-03]
Newspaper advertising (the last coefficient) seems to have a much weaker effect on sales per dollar spent. Next, we briefly sketch one way to assess the statistical significance of such a conclusion.
Our coefficients are estimated from a sample. There is intrinsic variability in our sampling procedure. We would like to understand how our estimated coefficients compare to the true coefficients. This is set up beautifully in [Data8, Section 13.2]:
A data scientist is using the data in a random sample to estimate an unknown parameter. She uses the sample to calculate the value of a statistic that she will use as her estimate. Once she has calculated the observed value of her statistic, she could just present it as her estimate and go on her merry way. But she’s a data scientist. She knows that her random sample is just one of numerous possible random samples, and thus her estimate is just one of numerous plausible estimates. By how much could those estimates vary? To answer this, it appears as though she needs to draw another sample from the population, and compute a new estimate based on the new sample. But she doesn’t have the resources to go back to the population and draw another sample. It looks as though the data scientist is stuck. Fortunately, a brilliant idea called the bootstrap can help her out. Since it is not feasible to generate new samples from the population, the bootstrap generates new random samples by a method called resampling: the new samples are drawn at random from the original sample.
Without going into full details (see [DS100, Section 17.3] for more), it works as follows. Let \(\{(\mathbf{x}_i, y_i)\}_{i=1}^n\) be our data. We assume that our sample is representative of the population and we simulate our sampling procedure by resampling from the sample. That is, we take a random sample with replacement \(\mathcal{X}_{\mathrm{boot},1} = \{(\mathbf{x}_i, y_i)\,:\,i \in I\}\) where \(I\) is a multi-set of elements from \([n]\) of size \(n\). We recompute our estimated coefficients on \(\mathcal{X}_{\mathrm{boot},1}\). Then we repeat independently for a desired number of replicates \(\mathcal{X}_{\mathrm{boot},1}, \ldots, \mathcal{X}_{\mathrm{boot},r}\). Plotting a histogram of the resulting coefficients gives some idea of the variability of our estimates.
We implement the bootstrap for linear regression in Python next.
def linregboot(A, b, replicates = np.int32(10000)):
n,m = A.shape
coeff_boot = np.zeros((m,replicates))
for i in range(replicates):
resample = rng.integers(0,n,n)
Aboot = A[resample,:]
bboot = b[resample]
coeff_boot[:,i] = mmids.ls_by_qr(Aboot,bboot)
return coeff_boot
First, let’s use a simple example from the lecture with a known ground truth.
n, b0, b1 = 100, -1, 1
x = np.linspace(0,10,num=n)
y = b0 + b1*x + rng.normal(0,1,n)
A = np.stack((np.ones(n),x),axis=-1)
The estimated coefficients are the following.
coeff = mmids.ls_by_qr(A,y)
print(coeff)
[-1.48048802 1.08741565]
Now we apply the bootstrap and plot histograms of the two coefficients.
coeff_boot = linregboot(A,y)
Show code cell source
plt.hist(coeff_boot[0,:])
plt.show()
Show code cell source
plt.hist(coeff_boot[1,:])
plt.show()
We see in the histograms that the true coefficient values \(-1\) and \(1\) fall within the likely range.
We return to the Advertising
dataset and apply the bootstrap.
n = np.size(TV)
A = np.stack((np.ones(n), TV, radio, newspaper), axis=-1)
coeff = mmids.ls_by_qr(A,sales)
print(coeff)
[ 2.93888937e+00 4.57646455e-02 1.88530017e-01 -1.03749304e-03]
coeff_boot = linregboot(A,sales)
Plotting a histogram of the coefficients corresponding to newspaper advertising shows that \(0\) is a plausible value, while it is not for TV advertising.
Show code cell source
plt.hist(coeff_boot[1,:])
plt.show()
Show code cell source
plt.hist(coeff_boot[3,:])
plt.show()