Background: review of differentiable functions of several variables

3.2. Background: review of differentiable functions of several variables#

We review the differential calculus of several variables. We highlight a few key results that will play an important role: the Chain Rule and the Mean Value Theorem.

3.2.1. Gradient#

Recall the definition of the gradient.

DEFINITION (Gradient) Let f:DR where DRd and let x0D be an interior point of D. Assume f is continuously differentiable at x0. The (column) vector

f(x0)=(f(x0)x1,,f(x0)xd)

is called the gradient of f at x0.

Note that the gradient is itself a function of x. In fact, unlike f, it is a vector-valued function.

Gradient as a function (with help from ChatGPT; adapted from (Source))

EXAMPLE: Consider the affine function

f(x)=qTx+r

where x=(x1,,xd)T,q=(q1,,qd)TRd. The partial derivatives of the linear term are given by

xi[qTx]=xi[j=1dqjxj]=xi[qixi]=qi.

So the gradient of f is

f(x)=q.

EXAMPLE: Consider the quadratic function

f(x)=12xTPx+qTx+r.

where x=(x1,,xd)T,q=(q1,,qd)TRd and PRd×d. The partial derivatives of the quadratic term are given by

xi[xTPx]=xi[j,k=1dPjkxjxk]=xi[Piixi2+j=1,jidPjixjxi+k=1,kidPikxixk],

where we used that all terms not including xi have partial derivative 0.

This last expression is

=2Piixi+j=1,jidPjixj+k=1,kidPikxk=j=1d[PT]ijxj+k=1d[P]ikxk=([P+PT]x)i.

So the gradient of f is

f(x)=12[P+PT]x+q.

If P is symmetric, this further simplifies to f(x)=Px+q.

It will be useful to compute the derivative of a function f(x) of several variables along a parametric curve g(t)=(g1(t),,gd(t))TRd for t in some closed interval of R. The following result is a special case of an important fact. We will use the following notation g(t)=(g1(t),,gm(t)), where gi is the derivative of gi. We say that g(t) is continuously differentiable at t=t0 if each of its component is.

EXAMPLE: (Parametric Line) The straight line between x0=(x0,1,,x0,d)T and x1=(x1,1,,x1,d)T in Rd can be parametrized as

g(t)=x0+t(x1x0),

where t goes from 0 (at which g(0)=x0) to 1 (at which g(1)=x1).

Then

gi(t)=ddt[x0,i+t(x1,ix0,i)]=x1,ix0,i,

so that

g(t)=x1x0.

Recall the Chain Rule in the single-variable case. Quoting Wikipedia:

The simplest form of the chain rule is for real-valued functions of one real variable. It states that if g is a function that is differentiable at a point c (i.e. the derivative g(c) exists) and f is a function that is differentiable at g(c), then the composite function fg is differentiable at c, and the derivative is (fg)(c)=f(g(c))g(c).

Here is a straightforward generalization of the Chain Rule.

THEOREM (Chain Rule) Let f:D1R, where D1R, and let g:D2R, where D2Rd. Assume that f is continuously differentiable at g(x0), an interior point of D1, and that g is continuously differentiable at x0, an interior point of D2. Then

(fg)(x0)=f(g(x0))g(x0).

Proof: We apply the Chain Rule for functions of one variable to the partial derivatives. For all i,

xif(g(x0))=f(g(x0))xig(x0).

Collecting the partial derivatives in a vector gives the claim.

Here is a different generalization of the Chain Rule. Again the composition fg denotes the function fg(t)=f(g(t)).

THEOREM (Chain Rule) Let f:D1R, where D1Rd, and let g:D2Rd, where D2R. Assume that f is continuously differentiable at g(t0), an interior point of D1, and that g is continuously differentiable at t0, an interior point of D2. Then

(fg)(t0)=f(g(t0))Tg(t0).

Proof: To simplify the notation, suppose that f is a real-valued function of x=(x1,,xd) whose components are themselves functions of tR. Assume f is continuously differentiable at x(t). To compute the total derivative df(t)dt, let Δxk=xk(t+Δt)xk(t), xk=xk(t) and

Δf=f(x1+Δx1,,xd+Δxd)f(x1,,xd).

We seek to compute the limit limΔt0ΔfΔt. To relate this limit to partial derivatives of f, we re-write Δf as a telescoping sum where each term involves variation of a single variable xk. That is,

Δf=[f(x1+Δx1,,xd+Δxd)f(x1,x2+Δx2,,xd+Δxd)]+[f(x1,x2+Δx2,,xd+Δxd)f(x1,x2,x3+Δx3,,xd+Δxd)]++[f(x1,,xd1,xd+Δxd)f(x1,,xd)].

Applying the Mean Value Theorem to each term gives

Δf=Δx1f(x1+θ1Δx1,x2+Δx2,,xd+Δxd)x1+Δx2f(x1,x2+θ2Δx2,x3+Δx3,,xd+Δxd)x2++Δxdf(x1,,xd1,xd+θdΔxd)xd

where 0<θk<1 for k=1,,d. Dividing by Δt, taking the limit Δt0 and using the fact that f is continuously differentiable, we get

df(t)dt=k=1df(x(t))xkdxk(t)dt.

As a first application of the Chain Rule, we generalize the Mean Value Theorem to the case of several variables. We will use this result later to prove a multivariable Taylor expansion result that will play a central role in this chapter.

THEOREM (Mean Value) Let f:DR where DRd. Let x0D and δ>0 be such that Bδ(x0)D. If f is continuously differentiable on Bδ(x0), then for any xBδ(x0)

f(x)=f(x0)+f(x0+ξp)Tp

for some ξ(0,1), where p=xx0.

One way to think of the Mean Value Theorem is as a 0-th order Taylor expansion. It says that, when x is close to x0, the value f(x) is close to f(x0) in a way that can be controlled in terms of the gradient in the neighborhood of x0. From this point of view, the term f(x0+ξp)Tp is called the Lagrange remainder.

Proof idea: We apply the single-variable result and the Chain Rule.

Proof: Let ϕ(t)=f(α(t)) where α(t)=x0+tp. Observe that ϕ(0)=f(x0) and ϕ(1)=f(x). By the Chain Rule and the parametric line example,

ϕ(t)=f(α(t))Tα(t)=f(α(t))Tp=f(x0+tp)Tp.

In particular, ϕ has a continuous first derivative on [0,1]. By the Mean Value Theorem in the single-variable case

ϕ(t)=ϕ(0)+tϕ(ξ)

for some ξ(0,t). Plugging in the expressions for ϕ(0) and ϕ(ξ) and taking t=1 gives the claim.

3.2.2. Second-order derivatives#

One can also define higher-order derivatives. We start with the single-variable case, where f:DR with DR and x0D is an interior point of D. Note that, if f exists in D, then it is itself a function of x. Then the second derivative at x0 is

f(x0)=d2f(x0)dx2=limh0f(x0+h)f(x0)h

provided the limit exists.

In the several variable case, we have the following:

DEFINITION (Second Partial Derivatives and Hessian) Let f:DR where DRd and let x0D be an interior point of D. Assume that f is continuously differentiable in an open ball around x0. Then f(x)/xi is itself a function of x and its partial derivative with respect to xj, if it exists, is denoted by

2f(x0)xjxi=limh0fxi(x0+hej)fxi(x0)h.

To simplify the notation, we write this as 2f(x0)/xi2 when j=i. If 2f(x)/xjxi and 2f(x)/xi2 exist and are continuous in an open ball around x0 for all i,j, we say that f is twice continuously differentiable at x0.

The matrix of second derivatives is called the Hessian and is denoted by

Hf(x0)=(2f(x0)x122f(x0)xdx12f(x0)x1xd2f(x0)xd2).

Like f and the gradient f, the Hessian Hf is a function of x. It is a matrix-valued function however.

When f is twice continuously differentiable at x0, its Hessian is a symmetric matrix.

THEOREM (Symmetry of the Hessian) Let f:DR where DRd and let x0D be an interior point of D. Assume that f is twice continuously differentiable at x0. Then for all ij

2f(x0)xjxi=2f(x0)xixj.

Proof idea: Two applications of the Mean Value Theorem show that the limits can be interchanged.

Proof: By definition of the partial derivative,

2f(x0)xjxi=limhj0fxi(x0+hjej)fxi(x0)hj=limhj0limhi01hjhi{[f(x0+hjej+hiei)f(x0+hjej)][f(x0+hiei)f(x0)]}=limhj0limhi01hi{[f(x0+hiei+hjej)f(x0+hiei)][f(x0+hjej)f(x0)]hj}=limhj0limhi01hi{xj[f(x0+hiei+θjhjej)f(x0+θjhjej)]}=limhj0limhi01hi{fxj(x0+hiei+θjhjej)fxj(x0+θjhjej)}

for some θj(0,1). Note that, on the third line, we rearranged the terms and, on the fourth line, we applied the Mean Value Theorem to f(x0+hiei+hjej)f(x0+hjej) as a continuously differentiable function of hj.

Because f/xj is continuously differentiable in an open ball around x0, a second application of the Mean Value Theorem gives for some θi(0,1)

limhj0limhi01hi{fxj(x0+hiei+θjhjej)fxj(x0+θjhjej)}=limhj0limhi0xi[fxj(x0+θjhjej+θihiei)]=limhj0limhi02f(x0+θjhjej+θihiei)xixj.

The claim then follows from the continuity of 2f/xixj.

EXAMPLE: Consider the quadratic function

f(x)=12xTPx+qTx+r.

Recall that the gradient of f is

f(x)=12[P+PT]x+q.

To simplify the calculation, let B=12[P+PT] and denote the rows of B by b1T,,bdT.

Each component of f is an affine function of x, specifically,

f(x)xi=biTx+qi.

Row i of the Hessian is simply the gradient transposed of f(x)xi which, by our previous results, is

(f(x)xi)T=biT.

Putting this together we get

Hf(x)=12[P+PT].

Observe that this is indeed a symmetric matrix.

Self-assessment quiz (with help from Claude, Gemini, and ChatGPT)

1 What does it mean for a function f to be continuously differentiable at x0?

a) f is continuous at x0.

b) All partial derivatives of f exist at x0.

c) All partial derivatives of f exist and are continuous in an open ball around x0.

d) The gradient of f is zero at x0.

2 What is the gradient of a function f:DR, where DRd, at a point x0D?

a) The rate of change of f with respect to x at x0

b) The vector of all second partial derivatives of f at x0

c) The vector of all first partial derivatives of f at x0

d) The matrix of all second partial derivatives of f at x0

3 Which of the following statements is true about the Hessian matrix of a twice continuously differentiable function?

a) It is always a diagonal matrix.

b) It is always a symmetric matrix.

c) It is always an invertible matrix.

d) It is always a positive definite matrix.

4 Let f(x,y,z)=x2+y2z2. What is the Hessian matrix of f?

a) (200020002)

b) (2x2y2z)

c) (2002)

d) (000000000)

5 What is the Hessian matrix of the quadratic function f(x)=12xTPx+qTx+r, where PRd×d and qRd?

a) Hf(x)=P

b) Hf(x)=PT

c) Hf(x)=12[P+PT]

d) Hf(x)=[P+PT]

Answer for 1: c. Justification: The text states, “If f exists and is continuous in an open ball around x0 for all i, then we say that f is continuously differentiable at x0.”

Answer for 2: c. Justification: From the text: “The (column) vector f(x0)=(f(x0)x1,,f(x0)xd) is called the gradient of f at x0.”

Answer for 3: b). Justification: The text states: “When f is twice continuously differentiable at x0, its Hessian is a symmetric matrix.”

Answer for 4: a). Justification: The Hessian is the matrix of second partial derivatives:

(2fx22fxy2fxz2fyx2fy22fyz2fzx2fzy2fz2)=(200020002)

Answer for 5: c. Justification: The text shows that the Hessian of the quadratic function is Hf(x)=12[P+PT].