2.3. The geometry of least squares: the orthogonal projection#
We consider the following problem: we are given \(A \in \mathbb{R}^{n\times m}\) an \(n\times m\) matrix and \(\mathbf{b} \in \mathbb{R}^n\) a vector. We are looking to solve the system
In the special case where \(A\) is invertible, a unique exact solution exists. Indeed:
DEFINITION (Nonsingular Matrix) A square matrix \(A \in \mathbb{R}^{n \times n}\) is nonsingular if it has rank \(n\). In that case, we also say that \(A\) is of full rank. \(\natural\)
THEOREM (Inverting a Nonsingular System) Let \(A \in \mathbb{R}^{n \times n}\) be a nonsingular square matrix. Then for any \(\mathbf{b} \in \mathbb{R}^n\), there exists a unique \(\mathbf{x} \in \mathbb{R}^n\) such that \(A \mathbf{x} = \mathbf{b}\). Moreover \(\mathbf{x} = A^{-1} \mathbf{b}\). \(\sharp\)
In general, however, a solution may not exist or may not be unique. We focus here on the over-determined case where the former situation generically occurs.
2.3.1. Orthogonal projection#
To solve the overdetermined case, i.e., when \(n > m\), we consider the following more general problem first. We have a linear subspace \(U \subseteq \mathbb{R}^n\) and a vector \(\mathbf{v} \notin U\). We want to find the vector \(\mathbf{p}\) in \(U\) that is closest to \(\mathbf{v}\) in Euclidean norm, that is, we want to solve
EXAMPLE: Consider the two-dimensional case with a one-dimensional subspace, say \(U = \mathrm{span}(\mathbf{u_1})\) with \(\|\mathbf{u}_1\|=1\). The geometrical intuition is in the following figure. The solution \(\mathbf{p} = \mathbf{v}^*\) has the property that the difference \(\mathbf{v} - \mathbf{v}^*\) makes a right angle with \(\mathbf{u}_1\), that is, it is orthogonal to it.
Figure: Orthogonal projection on a line (Source)
\(\bowtie\)
Letting \(\mathbf{v}^* = \alpha^* \,\mathbf{u}_1\), the geometrical condition above translates into
so
By Pythagoras, we then have for any \(\alpha \in \mathbb{R}\)
where we used that \(\mathbf{v} - \mathbf{v}^*\) is orthogonal to \(\mathbf{u}_1\) (and therefore \((\alpha^* - \alpha) \mathbf{u}_1\)) on the third line.
That confirms the optimality of \(\mathbf{v}^*\). The argument in this example carries through in higher dimension, as we show next. \(\lhd\)
DEFINITION (Orthogonal Projection on an Orthonormal List) Let \(\mathbf{q}_1,\ldots,\mathbf{q}_m\) be an orthonormal list. The orthogonal projection of \(\mathbf{v} \in \mathbb{R}^n\) on \(\{\mathbf{q}_i\}_{i=1}^m\) is defined as
\(\natural\)
THEOREM (Orthogonal Projection) Let \(U \subseteq V\) be a linear subspace and let \(\mathbf{v} \in \mathbb{R}^n\). Then:
a) There exists a unique solution \(\mathbf{p}^*\) to
We denote it by \(\mathbf{p}^* = \mathrm{proj}_U \mathbf{v}\) and refer to it as the orthogonal projection of \(\mathbf{v}\) onto \(U\).
b) The solution \(\mathbf{p}^*\) is characterized geometrically by
c) For any orthonormal basis \(\mathbf{q}_1,\ldots,\mathbf{q}_m\) of \(U\),
\(\sharp\)
Proof: Let \(\mathbf{p}^*\) be any vector in \(U\) satisfying \((*)\). We show first that it necessarily satisfies
Note that for any \(\mathbf{p} \in U\) the vector \(\mathbf{u} = \mathbf{p} - \mathbf{p}^*\) is also in \(U\). Hence by \((*)\) and Pythagoras,
Furthermore, equality holds only if \(\|\mathbf{p} - \mathbf{p}^*\|^2 = 0\) which holds only if \(\mathbf{p} = \mathbf{p}^*\) by the point-separating property of the Euclidean norm. Hence, if such a vector \(\mathbf{p}^*\) exists, it is unique.
It remains to show that there is at least one vector in \(U\) satisfying \((*)\). By Gram-Schmidt, the linear subspace \(U\) has an orthonormal basis \(\mathbf{q}_1,\ldots,\mathbf{q}_m\). By definition, \(\mathrm{proj}_{\{\mathbf{q}_i\}_{i=1}^m} \mathbf{v} \in \mathrm{span}(\{\mathbf{q}_i\}_{i=1}^m) = U\). We show that \(\mathrm{proj}_{\{\mathbf{q}_i\}_{i=1}^m} \mathbf{v}\) satisfies \((*)\). We can write any \(\mathbf{u} \in U\) as \(\sum_{i=1}^m \alpha_i \mathbf{q}_i\) with \(\alpha_i = \langle \mathbf{u}, \mathbf{q}_i \rangle\). So, using this representation, we get
where we used the orthonormality of the \(\mathbf{q}_j\)’s on the second line. \(\square\)
EXAMPLE: (continued) Consider again the linear subspace \(W = \mathrm{span}(\mathbf{w}_1,\mathbf{w}_2,\mathbf{w}_3)\), where \(\mathbf{w}_1 = (1,0,1)^T\), \(\mathbf{w}_2 = (0,1,1)^T\), and \(\mathbf{w}_3 = (1,-1,0)^T\). We have shown that
is an orthonormal basis. Let \(\mathbf{w}_4 = (0,0,2)^T\). It is immediate that \(\mathbf{w}_4 \notin \mathrm{span}(\mathbf{w}_1,\mathbf{w}_2)\) since vectors in that span are of the form \((x,y,x+y)^T\) for some \(x,y \in \mathbb{R}\).
We can however compute the orthogonal projection \(\mathbf{w}_4\) onto \(W\). The inner products are
So
As a sanity check, note that \(\mathbf{w}_4 \in W\) since its third entry is equal to the sum of its first two entries. \(\lhd\)
The map \(\mathrm{proj}_U\) is linear, that is, \(\mathrm{proj}_U (\alpha \,\mathbf{x} + \mathbf{y}) = \alpha \,\mathrm{proj}_U \mathbf{x} + \mathrm{proj}_U\mathbf{y}\) for all \(\alpha \in \mathbb{R}\) and \(\mathbf{x}, \mathbf{y} \in \mathbb{R}^n\). Indeed,
Any linear map from \(\mathbb{R}^n\) to \(\mathbb{R}^n\) can be encoded as an \(n \times n\) matrix \(P\).
Let
and note that computing
lists the coefficients in the expansion of \(\mathrm{proj}_U \mathbf{v}\) over the basis \(\mathbf{q}_1,\ldots,\mathbf{q}_m\).
Hence we see that
Indeed, for any vector \(\mathbf{v}\),
So the output is a linear combination of the columns of \(Q\) (i.e., the \(\mathbf{q}_i\)’s) where the coefficients are the entries of the vector in square brackets \(Q^T \mathbf{v}\).
EXAMPLE: (continued) Consider again the linear subspace \(W = \mathrm{span}(\mathbf{w}_1,\mathbf{w}_2,\mathbf{w}_3)\), where \(\mathbf{w}_1 = (1,0,1)^T\), \(\mathbf{w}_2 = (0,1,1)^T\), and \(\mathbf{w}_3 = (1,-1,0)^T\), with orthonormal basis
Then orthogonal projection onto \(W\) can be written in matrix form as follows. The matrix \(Q\) is
Then
So the projection of \(\mathbf{w}_4 = (0,0,2)^T\) is
as previously computed. \(\lhd\)
The matrix \(P= Q Q^T\) is not to be confused with
where \(I_{m \times m}\) denotes the \(m \times m\) identity matrix.
EXAMPLE: Let \(\mathbf{q}_1,\ldots,\mathbf{q}_n\) be an orthonormal basis of \(\mathbb{R}^n\) and form the matrix
We show that \(Q^{-1} = Q^T\).
By the definition of matrix multiplication, recall that
where \(I_{n \times n}\) denotes the \(n \times n\) identity matrix. This of course follows from the fact that the \(\mathbf{q}_i\)’s are orthonormal.
In the other direction, we claim that \(Q Q^T = I_{n \times n}\) as well. Indeed the matrix \(Q Q^T\) is the orthogonal projection on the span of the \(\mathbf{q}_i\)’s, that is, \(\mathbb{R}^n\). By the Orthogonal Projection Theorem, the orthogonal projection \(Q Q^T \mathbf{v}\) finds the closest vector to \(\mathbf{v}\) in the span of the \(\mathbf{q}_i\)’s. But that span contains all vectors, including \(\mathbf{v}\), so we must have \(Q Q^T \mathbf{v} = \mathbf{v}\). Since this holds for all \(\mathbf{v} \in \mathbb{R}^n\), the matrix \(Q Q^T\) is the identity map and we have proved the claim. \(\lhd\)
Matrices that satisfy
are called orthogonal matrices.
DEFINITION (Orthogonal Matrix) A square matrix \(Q \in \mathbb{R}^{m\times m}\) is orthogonal if \(Q^T Q = Q Q^T = I_{m \times m}\). \(\natural\)
THINK-PAIR-SHARE: Let \(\mathcal{Z}\) be a linear subspace of \(\mathbb{R}^n\) and let \(\mathbf{v} \in \mathbb{R}^n\). Show that
[Hint: Use the geometric characterization.] \(\ddagger\)
2.3.2. Orthogonal complement#
Before returning to overdetermined systems, we take a little detour to derive a consequence of the orthogonal projection that will be useful later. The Orthogonal Projection Theorem implies that any \(\mathbf{v} \in \mathbb{R}^n\) can be decomposed into its orthogonal projection onto \(U\) and a vector orthogonal to it.
DEFINITION (Orthogonal Complement) Let \(U \subseteq \mathbb{R}^n\) be a linear subspace. The orthogonal complement of \(U\), denoted \(U^\perp\), is defined as
\(\natural\)
EXAMPLE: Continuing a previous example, we compute the orthogonal complement of the linear subspace \(W = \mathrm{span}(\mathbf{w}_1,\mathbf{w}_2,\mathbf{w}_3)\), where \(\mathbf{w}_1 = (1,0,1)^T\), \(\mathbf{w}_2 = (0,1,1)^T\), and \(\mathbf{w}_3 = (1,-1,0)^T\). One way to proceed is to find all vectors that are orthogonal to the orthonormal basis
We require
The first equation implies \(u_3 = -u_1\), which after replacing into the second equation and rearranging gives \(u_2 = u_1\).
So all vectors of the form \((u_1,u_1,-u_1)^T\) for some \(u_1 \in \mathbb{R}\) are orthogonal to all of \(W\). This is a one-dimensional linear subspace. We can choose an orthonormal basis by finding a solution to
Take \(u_1 = 1/\sqrt{3}\), that is, let
Then we have
\(\lhd\)
LEMMA (Orthogonal Decomposition) Let \(U \subseteq \mathbb{R}^n\) be a linear subspace and let \(\mathbf{v} \in \mathbb{R}^n\). Then \(\mathbf{v}\) can be decomposed as \(\mathrm{proj}_U \mathbf{v} + (\mathbf{v} - \mathrm{proj}_U\mathbf{v})\) where \(\mathrm{proj}_U \mathbf{v} \in U\) and \((\mathbf{v} - \mathrm{proj}_U \mathbf{v}) \in U^\perp\). Moreover, this decomposition is unique in the following sense: if \(\mathbf{v} = \mathbf{u}_1 + \mathbf{u}_2\) with \(\mathbf{u}_1 \in U\) and \(\mathbf{u}_2 \in U^\perp\), then \(\mathbf{u}_1 = \mathrm{proj}_U \mathbf{v}\) and \(\mathbf{u}_2 = \mathbf{v} - \mathrm{proj}_U \mathbf{v}\). \(\flat\)
Proof: The first part is an immediate consequence of the Orthogonal Projection Theorem. For the second part, assume \(\mathbf{v} = \mathbf{u}_1 + \mathbf{u}_2\) with \(\mathbf{u}_1 \in U\) and \(\mathbf{u}_2 \in U^\perp\). Subtracting \(\mathbf{v} = \mathrm{proj}_U \mathbf{v} + (\mathbf{v} - \mathrm{proj}_U\mathbf{v})\), we see that
with
If \(\mathbf{w}_1 = \mathbf{w}_2 = \mathbf{0}\), we are done. Otherwise, they must both be nonzero by \((*)\). Further, by the Properties of Orthonormal Lists, \(\mathbf{w}_1\) and \(\mathbf{w}_2\) must be linearly independent. But this is contradicted by the fact that \(\mathbf{w}_2 = - \mathbf{w}_1\) by \((*)\). \(\square\)
Figure: Orthogonal decomposition (Source)
\(\bowtie\)
Formally, the Orthogonal Decomposition Lemma states that \(\mathbb{R}^n\) is a direct sum of any linear subspace \(U\) and of its orthogonal complement \(U^\perp\): that is, any vector \(\mathbf{v} \in \mathbb{R}^n\) can be written uniquely as \(\mathbf{v} = \mathbf{u}_1 + \mathbf{u}_2\) with \(\mathbf{u}_1 \in U\) and \(\mathbf{u}_2 \in U^\perp\). This is denoted \(\mathbb{R}^n = U \oplus U^\perp\).
Let \(\mathbf{a}_1,\ldots,\mathbf{a}_\ell\) be an orthonormal basis of \(U\) and \(\mathbf{b}_1,\ldots,\mathbf{b}_k\) be an orthonormal basis of \(U^\perp\). By definition of the orthogonal complement, the list
is orthonormal, so it forms a basis of its span. Because any vector in \(\mathbb{R}^n\) can be written as a sum of a vector from \(U\) and a vector from \(U^\perp\), all of \(\mathbb{R}^n\) is in the span of \(\mathcal{L}\). It follows from the Dimension Theorem that \(n = \ell + k\), that is,
2.3.3. Overdetermined systems#
In this section, we discuss the least-squares problem. Let again \(A \in \mathbb{R}^{n\times m}\) be an \(n\times m\) matrix with linearly independent columns and let \(\mathbf{b} \in \mathbb{R}^n\) be a vector. We are looking to solve the system
If \(n=m\), we can use the matrix inverse to solve the system, as discussed in the previous subsection. But we are particularly interested in the overdetermined case, i.e. when \(n > m\): there are more equations than variables. We cannot use the matrix inverse then. Indeed, because the columns do not span all of \(\mathbb{R}^n\), there is a vector \(\mathbf{b} \in \mathbb{R}^n\) that is not in the column space of \(A\).
A natural way to make sense of the overdetermined problem is to cast it as the linear least squares problem
In words, we look for the best-fitting solution under the Euclidean norm. Equivalently, writing
we seek a linear combination of the columns of \(A\) that minimizes the objective
We have already solved a closely related problem when we introduced the orthogonal projection. We make the connection explicit next.
THEOREM (Normal Equations) Let \(A \in \mathbb{R}^{n\times m}\) be an \(n\times m\) matrix with \(n \geq m\) and let \(\mathbf{b} \in \mathbb{R}^n\) be a vector. A solution \(\mathbf{x}^*\) to the linear least squares problem
satisfies the normal equations
If further the columns of \(A\) are linearly independent, then there exists a unique solution \(\mathbf{x}^*\). \(\sharp\)
Proof idea: Apply our characterization of the orthogonal projection onto the column space of \(A\).
Proof: Let \(U = \mathrm{col}(A) = \mathrm{span}(\mathbf{a}_1,\ldots,\mathbf{a}_m)\). By the Orthogonal Projection Theorem, the orthogonal projection \(\mathbf{p}^* = \mathrm{proj}_{U} \mathbf{b}\) of \(\mathbf{b}\) onto \(U\) is the unique, closest vector to \(\mathbf{b}\) in \(U\), that is,
Because \(\mathbf{p}^*\) is in \(U = \mathrm{col}(A)\), it must be of the form \(\mathbf{p}^* = A \mathbf{x}^*\). This establishes that \(\mathbf{x}^*\) is a solution to the linear least squares problem in the statement.
Important observation: while we have shown that \(\mathbf{p}^*\) is unique (by the Orthogonal Projection Theorem), it is not clear at all that \(\mathbf{x}^*\) (i.e., the linear combination of columns of \(A\) corresponding to \(\mathbf{p}^*\)) is unique. By the Orthogonal Projection Theorem, it must satisfy \(\langle \mathbf{b} - A \mathbf{x}^*, \mathbf{u}\rangle = 0\) for all \(\mathbf{u} \in U\). Because the columns \(\mathbf{a}_i\) are in \(U\), that implies that
Stacking up these equations gives in matrix form
as claimed.
We have seen in a previous example that, when \(A\) has full column rank, the matrix \(A^T A\) is invertible. That implies the uniqueness claim. \(\square\)
NUMERICAL CORNER: To solve a linear system in Numpy, use numpy.linalg.solve
. As an example, we consider the overdetermined system with
We use numpy.ndarray.T
for the transpose.
w1 = np.array([1., 0., 1.])
w2 = np.array([0., 1., 1.])
A = np.stack((w1, w2),axis=-1)
b = np.array([0., 0., 2.])
x = LA.solve(A.T @ A, A.T @ b)
print(x)
[0.66666667 0.66666667]
We can also use numpy.linalg.lstsq
directly on the overdetermined system to compute the least-square solution.
x = LA.lstsq(A, b, rcond=None)[0]
print(x)
[0.66666667 0.66666667]
\(\unlhd\)