Motivating example: predicting sales

2.1. Motivating example: predicting sales#

Figure: Helpful map of ML by scitkit-learn (Source)

ml-cheat-sheet

\(\bowtie\)

The following dataset is from the excellent textbook [ISLP]. Quoting [ISLP, Section 2.1]:

Suppose that we are statistical consultants hired by a client to provide advice on how to improve sales of a particular product. The Advertising data set consists of the sales of that product in 200 different markets, along with advertising budgets for the product in each of those markets for three different media: TV, radio, and newspaper. […] It is not possible for our client to directly increase sales of the product. On the other hand, they can control the advertising expenditure in each of the three media. Therefore, if we determine that there is an association between advertising and sales, then we can instruct our client to adjust advertising budgets, thereby indirectly increasing sales. In other words, our goal is to develop an accurate model that can be used to predict sales on the basis of the three media budgets.

This a regression problem. That is, we want to estimate the relationship between an outcome variable and one or more predictors (or features). We load the data, show its first few lines and some statistics.

df = pd.read_csv('advertising.csv')
Hide code cell source
df.head()
TV radio newspaper sales
0 230.1 37.8 69.2 22.1
1 44.5 39.3 45.1 10.4
2 17.2 45.9 69.3 9.3
3 151.5 41.3 58.5 18.5
4 180.8 10.8 58.4 12.9
Hide code cell source
df.describe()
TV radio newspaper sales
count 200.000000 200.000000 200.000000 200.000000
mean 147.042500 23.264000 30.554000 14.022500
std 85.854236 14.846809 21.778621 5.217457
min 0.700000 0.000000 0.300000 1.600000
25% 74.375000 9.975000 12.750000 10.375000
50% 149.750000 22.900000 25.750000 12.900000
75% 218.825000 36.525000 45.100000 17.400000
max 296.400000 49.600000 114.000000 27.000000

We will focus for now on the TV budget.

TV = df['TV'].to_numpy()
sales = df['sales'].to_numpy()

We make a scatterplot showing the realtion between those two quantities.

Hide code cell source
plt.scatter(TV, sales)
plt.xlabel('TV')
plt.ylabel('sales')
plt.show()
../../_images/885b59e7d5053213651ba9cc59f45cb32be70d032493e4d5761714a538bfa307.png

There does seem to be a relationship between the two. Roughly a higher TV budget is linked to higher sales, although the correspondence is not perfect. To express the relationship more quantitatively, we seek a function \(f\) such that

\[ y \approx f(\mathbf{x}) \]

where \(\mathbf{x}\) denotes a sample TV budget and \(y\) is the corresponding observed sales. We might posit for instance that there exists a true \(f\) and that each observation is disrupted by some noise \(\varepsilon\)

\[ y = f(\mathbf{x}) + \varepsilon. \]

A natural way to estimate such an \(f\) from data is \(k\)-nearest-neighbors (\(k\)-NN) regression. Let the data be of the form \(\{(\mathbf{x}_i, y_i)\}_{i=1}^n\), where in our case \(\mathbf{x}_i\) is the TV budget of the \(i\)-th sample and \(y_i\) is the corresponding sales. For each \(\mathbf{x}\) (not necessarily in the data), we do the following:

  • find the \(k\) nearest \(\mathbf{x}_i\)’s to \(\mathbf{x}\)

  • take an average of the corresponding \(y_i\)’s.

We implement this method in Python. We use the function numpy.argsort to sort an array and the function numpy.absolute to compute the absolute deviation. Our quick implementation here assumes that the \(\mathbf{x}_i\)’s are scalars.

def knnregression(x,y,k,xnew):
    n = len(x)
    closest = np.argsort([np.absolute(x[i]-xnew) for i in range(n)])
    return np.mean(y[closest[0:k]])

For \(k=3\) and a grid of \(1000\) points, we get the following approximation \(\hat{f}\). Here the function numpy.linspace creates an array of equally spaced points.

k = 3
xgrid = np.linspace(TV.min(), TV.max(), num=1000)
yhat = [knnregression(TV,sales,k,xnew) for xnew in xgrid]
Hide code cell source
plt.scatter(TV, sales, alpha=0.5)
plt.xlabel('TV')
plt.ylabel('sales')
plt.plot(xgrid, yhat, 'r')
plt.show()
../../_images/6ce1d2597d4e571dea2ad90289fcc706d66e56a642727a30e2dafd021e548bd7.png

A higher \(k\) gives something smoother.

k = 10
xgrid = np.linspace(TV.min(), TV.max(), num=1000)
yhat = [knnregression(TV,sales,k,xnew) for xnew in xgrid]
Hide code cell source
plt.scatter(TV, sales, alpha=0.5)
plt.xlabel('TV')
plt.ylabel('sales')
plt.plot(xgrid, yhat, 'r')
plt.show()
../../_images/181e63198f8e4cc1950f5935bdca1fced4b77c5780cbc81ee1755096037c1e0c.png

One downside of \(k\)-NN regression is that it does not give an easily interpretable relationship: if I increase my TV budget by \(\Delta\) dollars, how is it expected to affect the sales? Another issue arises in high dimension where the counter-intuitive phenomena we discussed in a previous section can have a significant impact. Recall in particular the High-dimensional Cube Theorem. If we have \(d\) predictors – where \(d\) is large – and our data is distributed uniformly in a bounded region, then any given \(\mathbf{x}\) will be far from any of our data points. In that case, the \(y\)-values of the closest \(\mathbf{x}_i\)’s may not be predictive. This is referred to as the Curse of Dimensionality.

One way out is to make stronger assumptions on the function \(f\). For instance, we can assume that the true relationship is (approximately) affine, that is,

\[ y \approx \beta_0 + \beta_1 x. \]

Or if we have \(d\) predictors:

\[ y \approx \beta_0 + \sum_{j=1}^d \beta_j x_j. \]

How do we estimate appropriate intercept and coefficients? The standard approach is to minimize the sum of the squared errors

\[ \sum_{i=1}^n \left(y_i - \left\{\beta_0 + \sum_{j=1}^d \beta_j (\mathbf{x}_{i})_j\right\}\right)^2, \]

where \((\mathbf{x}_{i})_j\) is the \(j\)-th entry of input vector \(\mathbf{x}_i\) and \(y_i\) is the corresponding \(y\)-value. This is called multiple linear regression.

It is a least-squares problem. We re-write it in a more convenient matrix form and combine \(\beta_0\) with the other \(\beta_i\)’s by adding a dummy predictor to each sample. Let

\[\begin{split} \mathbf{y} = \begin{pmatrix} y_1 \\ y_2 \\ \vdots \\ y_n \end{pmatrix}, \quad\quad A = \begin{pmatrix} 1 & \mathbf{x}_1^T \\ 1 & \mathbf{x}_2^T \\ \vdots & \vdots \\ 1 & \mathbf{x}_n^T \end{pmatrix} \quad\text{and}\quad \boldsymbol{\beta} = \begin{pmatrix} \beta_0 \\ \beta_1 \\ \vdots \\ \beta_d \end{pmatrix}. \end{split}\]

Then observe that

\[\begin{align*} \|\mathbf{y} - A \boldsymbol{\beta}\|^2 &= \sum_{i=1}^n \left(y_i - (A \boldsymbol{\beta})_i\right)^2\\ &= \sum_{i=1}^n \left(y_i - \left\{\beta_0 + \sum_{j=1}^d \beta_j (\mathbf{x}_{i})_j\right\}\right)^2. \end{align*}\]

The linear least-squares problem is then formulated as

\[ \min_{\boldsymbol{\beta}} \|\mathbf{y} - A \boldsymbol{\beta}\|^2. \]

In words, we are looking for a linear combination of the columns of \(A\) that is closest to \(\mathbf{y}\) in Euclidean distance. Indeed, minimizing the squared Euclidean distance is equivalent to minimizing its square root, as the latter in an increasing function.

One could solve this optimization problem through calculus (and we will come back to this approach later in the course), but understanding the geometric and algebraic structure of the problem turns out to provide powerful insights into its solution – and that of many of problems in data science. It will also be an opportunity to review some basic linear-algebraic concepts along the way.

We will come back to the Advertising dataset later in the chapter.