\(\newcommand{\bmu}{\boldsymbol{\mu}}\) \(\newcommand{\bSigma}{\boldsymbol{\Sigma}}\) \(\newcommand{\bfbeta}{\boldsymbol{\beta}}\) \(\newcommand{\bflambda}{\boldsymbol{\lambda}}\) \(\newcommand{\bgamma}{\boldsymbol{\gamma}}\) \(\newcommand{\bsigma}{{\boldsymbol{\sigma}}}\) \(\newcommand{\bpi}{\boldsymbol{\pi}}\) \(\newcommand{\btheta}{{\boldsymbol{\theta}}}\) \(\newcommand{\bphi}{\boldsymbol{\phi}}\) \(\newcommand{\balpha}{\boldsymbol{\alpha}}\) \(\newcommand{\blambda}{\boldsymbol{\lambda}}\) \(\renewcommand{\P}{\mathbb{P}}\) \(\newcommand{\E}{\mathbb{E}}\) \(\newcommand{\indep}{\perp\!\!\!\perp} \newcommand{\bx}{\mathbf{x}}\) \(\newcommand{\bp}{\mathbf{p}}\) \(\renewcommand{\bx}{\mathbf{x}}\) \(\newcommand{\bX}{\mathbf{X}}\) \(\newcommand{\by}{\mathbf{y}}\) \(\newcommand{\bY}{\mathbf{Y}}\) \(\newcommand{\bz}{\mathbf{z}}\) \(\newcommand{\bZ}{\mathbf{Z}}\) \(\newcommand{\bw}{\mathbf{w}}\) \(\newcommand{\bW}{\mathbf{W}}\) \(\newcommand{\bv}{\mathbf{v}}\) \(\newcommand{\bV}{\mathbf{V}}\) \(\newcommand{\bfg}{\mathbf{g}}\) \(\newcommand{\bfh}{\mathbf{h}}\) \(\newcommand{\horz}{\rule[.5ex]{2.5ex}{0.5pt}}\) \(\renewcommand{\S}{\mathcal{S}}\) \(\newcommand{\X}{\mathcal{X}}\) \(\newcommand{\var}{\mathrm{Var}}\) \(\newcommand{\pa}{\mathrm{pa}}\) \(\newcommand{\Z}{\mathcal{Z}}\) \(\newcommand{\bh}{\mathbf{h}}\) \(\newcommand{\bb}{\mathbf{b}}\) \(\newcommand{\bc}{\mathbf{c}}\) \(\newcommand{\cE}{\mathcal{E}}\) \(\newcommand{\cP}{\mathcal{P}}\) \(\newcommand{\bbeta}{\boldsymbol{\beta}}\) \(\newcommand{\bLambda}{\boldsymbol{\Lambda}}\) \(\newcommand{\cov}{\mathrm{Cov}}\) \(\newcommand{\bfk}{\mathbf{k}}\) \(\newcommand{\idx}[1]{}\) \(\newcommand{\xdi}{}\)

3.7. Exercises#

3.7.1. Warm-up worksheets#

(with help from Claude, Gemini, and ChatGPT)

Section 3.2

E3.2.1 Calculate the gradient of the function \(f(x_1, x_2) = 3x_1^2 - 2x_1x_2 + 4x_2^2 - 5x_1 + 2x_2\) at the point \((1, -1)\).

E3.2.2 Find the Hessian matrix of the function \(f(x_1, x_2, x_3) = 2x_1x_2 + 3x_2x_3 - x_1^2 - 4x_3^2\).

E3.2.3 Compute the partial derivatives \(\frac{\partial f}{\partial x_1}\) and \(\frac{\partial f}{\partial x_2}\) for the function \(f(x_1, x_2) = \sin(x_1) \cos(x_2)\) at the point \((\frac{\pi}{4}, \frac{\pi}{3})\).

E3.2.4 Let \(f(x_1, x_2) = e^{x_1} + x_1x_2 - x_2^2\) and \(g(t) = (t^2, t)\). Calculate \((f \circ g)'(1)\) using the Chain Rule.

E3.2.5 Verify the Symmetry of the Hessian for the function \(f(x_1, x_2) = x_1^3 + 3x_1x_2^2 - 2x_2^3\) at the point \((1, 2)\).

E3.2.6 Find the gradient of the function \(f(x_1, x_2, x_3) = 2x_1 - 3x_2 + 4x_3\) at the point \((1, -2, 3)\).

E3.2.7 Calculate the second-order partial derivatives of the function \(f(x_1, x_2) = x_1^2 \sin(x_2)\).

E3.2.8 Compute the gradient of the quadratic function \(f(x_1, x_2) = 3x_1^2 + 2x_1x_2 - x_2^2 + 4x_1 - 2x_2\) at the point \((1, -1)\).

E3.2.9 Find the Hessian matrix of the function \(f(x_1, x_2, x_3) = x_1^2 + 2x_2^2 + 3x_3^2 - 2x_1x_2 + 4x_1x_3 - 6x_2x_3\).

E3.2.10 Apply the multivariable Mean Value Theorem to the function \(f(x_1, x_2) = x_1^2 + x_2^2\) on the line segment joining the points \((0, 0)\) and \((1, 2)\).

E3.2.11 Find the partial derivatives \(\frac{\partial f}{\partial x}\) and \(\frac{\partial f}{\partial y}\) of the function \(f(x, y) = x^3y^2 - 2xy^3 + y^4\).

E3.2.12 Compute the gradient \(\nabla f(x, y)\) of the function \(f(x, y) = \ln(x^2 + 2y^2)\) at the point \((1, 2)\).

E3.2.13 Find the Hessian matrix of the function \(g(x, y) = \sin(x) \cos(y)\).

E3.2.14 If \(p(x, y, z) = x^2yz + 3xyz^2\), find all second partial derivatives of \(p\).

E3.2.15 Verify that the function \(q(x, y) = x^3 - 3xy^2\) satisfies Laplace’s equation: \(\frac{\partial^2 q}{\partial x^2} + \frac{\partial^2 q}{\partial y^2} = 0\).

E3.2.16 Consider the function \(s(x, y) = x^2 + 4xy + 5y^2\). Use the Mean Value Theorem to approximate \(s(1.1, 0.9)\) using the point \((1, 1)\).

E3.2.17 A particle moves along a path given by \(c(t) = (t^2, t^3)\). The temperature at a point \((x, y)\) is given by \(u(x, y) = e^{-x^2 - y^2}\). Find the rate of change of the temperature experienced by the particle at time \(t = 1\).

E3.2.18 Apply the Chain Rule to find \(\frac{d}{dt} f(g(t))\) for \(f(x, y) = x^2 + y^2\) and \(g(t) = (t, \sin t)\).

E3.2.19 Use the Chain Rule to find \(\frac{d}{dt} f(g(t))\) for \(f(x, y) = xy\) and \(g(t) = (t^2, \cos t)\).

Section 3.3

E3.3.1 Consider the function \(f(x_1, x_2) = x_1^2 + 2x_2^2\). Find all points \((x_1, x_2)\) where \(\nabla f(x_1, x_2) = 0\).

E3.3.2 Let \(f(x_1, x_2) = x_1^2 - x_2^2\). Find the directional derivative of \(f\) at the point \((1, 1)\) in the direction \(v = (\frac{1}{\sqrt{2}}, \frac{1}{\sqrt{2}})\).

E3.3.3 Consider the function \(f(x_1, x_2) = x_1^2 + 2x_1x_2 + x_2^2\). Find the second directional derivative of \(f\) at the point \((1, 1)\) in the direction \(v = (\frac{1}{\sqrt{2}}, \frac{1}{\sqrt{2}})\).

E3.3.4 Consider the optimization problem \(\min_{x_1, x_2} x_1^2 + x_2^2\) subject to \(x_1 + x_2 = 1\). Write down the Lagrangian function \(L(x_1, x_2, \lambda)\).

E3.3.5 For the optimization problem in E3.3.4, find all points \((x_1, x_2, \lambda)\) satisfying the first-order necessary conditions.

E3.3.6 For the optimization problem in E3.3.4, check the second-order sufficient conditions at the point \((\frac{1}{2}, \frac{1}{2}, -1)\).

E3.3.7 Consider the function \(f(x_1, x_2, x_3) = x_1^2 + x_2^2 + x_3^2\). Find all points \((x_1, x_2, x_3)\) satisfying the first-order necessary conditions for the optimization problem \(\min_{x_1, x_2, x_3} f(x_1, x_2, x_3)\) subject to \(x_1 + 2x_2 + 3x_3 = 6\).

E3.3.8 For the optimization problem in E3.3.7, check the second-order sufficient conditions at the point \((\frac{1}{2}, 1, \frac{3}{2}, -1)\).

E3.3.9 Let \(f: \mathbb{R}^2 \to \mathbb{R}\) be defined by \(f(x_1, x_2) = x_1^3 - 3x_1x_2^2\). Determine whether the vector \(v = (1, 1)\) is a descent direction for \(f\) at the point \((1, 0)\).

E3.3.10 Let \(f: \mathbb{R}^2 \to \mathbb{R}\) be defined by \(f(x_1, x_2) = x_1^2 + x_2^2 - 2x_1 - 4x_2 + 5\). Find the Hessian matrix of \(f\).

E3.3.11 Let \(f: \mathbb{R}^2 \to \mathbb{R}\) be defined by \(f(x_1, x_2) = -x_1^2 - x_2^2\). Compute the second directional derivative of \(f\) at the point \((0, 0)\) in the direction \(v = (1, 0)\).

E3.3.12 Calculate the directional derivative of \(f(x,y) = x^2 + y^2\) at the point \((1,1)\) in the direction of \(\mathbf{v} = (1,1)\).

E3.3.13 Determine the Hessian matrix of \(f(x,y) = x^3 + y^3 - 3xy\) at the point \((1,1)\).

Section 3.4

E3.4.1 Find the convex combination of points \((2, 3)\) and \((4, 5)\) with \(\alpha = 0.3\).

E3.4.2 Determine whether the following set is convex:

\[ S = \{(x_1, x_2) \in \mathbb{R}^2 : x_1^2 + x_2^2 \leq 1\}. \]

E3.4.3 Let \(S_1 = \{x \in \mathbb{R}^2 : x_1 + x_2 \leq 1\}\) and \(S_2 = \{x \in \mathbb{R}^2 : x_1 - x_2 \leq 1\}\). Show that \(S_1 \cap S_2\) is a convex set.

E3.4.4 Verify that the function \(f(x) = \log(e^{x_1} + e^{x_2})\) is convex using the Hessian matrix.

E3.4.5 Find the global minimizer of the function \(f(x) = x^2 + 2x + 1\).

E3.4.6 Consider the function \(f(x) = |x|\). Show that \(f\) is convex but not differentiable at \(x = 0\).

E3.4.7 Verify that the function \(f(x) = x^2 + 2y^2\) is strongly convex with \(m = 2\).

E3.4.8 Find the projection of the point \(x = (1, 2)\) onto the set \(D = \{(x_1, x_2) \in \mathbb{R}^2 : x_1 + x_2 = 1\}\).

E3.4.9 Let \(f(x) = x^4 - 2x^2 + 1\). Determine whether \(f\) is convex.

E3.4.10 Let \(f(x) = e^x\). Determine whether \(f\) is log-concave.

E3.4.11 Let \(f(x) = x^2 - 2x + 2\). Determine whether \(f\) is strongly convex.

E3.4.12 Let \(A = \begin{pmatrix} 1 & 2 \\ 2 & 5 \end{pmatrix}\) and \(b = \begin{pmatrix} 1 \\ 0 \end{pmatrix}\). Find the vector \(x\) that minimizes \(\|Ax - b\|_2^2\).

E3.4.13 Given the set \(D = \{(x, y) \in \mathbb{R}^2 : x^2 + y^2 < 4\}\), determine if \(D\) is convex.

E3.4.14 Determine if the set \(D = \{(x, y) \in \mathbb{R}^2 : x + y \leq 1\}\) is convex.

E3.4.15 Determine if the set \(D = \{(x, y) \in \mathbb{R}^2 : x^2 - y^2 \leq 1\}\) is convex.

Section 3.5

E3.5.1 Given the function \(f(x, y) = x^2 + 4y^2\), find the direction of steepest descent at the point \((1, 1)\).

E3.5.2 Consider the function \(f(x) = x^2 + 4x + 5\). Compute the gradient \(\nabla f(x)\) and find the direction of steepest descent at \(x_0 = 1\).

E3.5.3 Given the function \(f(x) = x^3 - 6x^2 + 9x + 2\), perform two iterations of gradient descent starting from \(x_0 = 0\) with a step size of \(\alpha = 0.1\).

E3.5.4 Let \(f(x) = x^2\). Show that \(f\) is \(L\)-smooth for some \(L > 0\).

E3.5.5 Let \(f(x) = x^2\). Perform one step of gradient descent starting from \(x_0 = 2\) with step size \(\alpha = 0.1\).

E3.5.6 Consider the \(2\)-smooth function \(f(x) = x^2\). Starting from \(x_0 = 3\), compute the point \(x_1\) obtained after one iteration of gradient descent with step size \(\alpha = \frac{1}{2}\).

E3.5.7 Verify that the function \(f(x) = 2x^2 + 1\) is \(4\)-strongly convex.

E3.5.8 For the \(4\)-strongly convex function \(f(x) = 2x^2 + 1\) with global minimizer \(x^* = 0\), compute \(\frac{1}{2m}\|\nabla f(1)\|^2\) and compare it with \(f(1) - f(x^*)\).

E3.5.9 Let \(f(x) = x^4\). Is \(f\) \(L\)-smooth for some \(L > 0\)? Justify your answer.

E3.5.10 Let \(f(x) = x^3\). Is \(f\) \(m\)-strongly convex for some \(m > 0\)? Justify your answer.

E3.5.11 Let \(f(x) = x^2 + 2x + 1\). Show that \(f\) is \(m\)-strongly convex for some \(m > 0\).

E3.5.12 Let \(f(x) = x^2 + 2x + 1\). Perform one step of gradient descent starting from \(x_0 = -3\) with step size \(\alpha = 0.2\).

E3.5.13 Given the function \(f(x, y) = x^2 + y^2\), perform two steps of gradient descent starting at \((1, 1)\) with a step size \(\alpha = 0.1\).

E3.5.14 For a quadratic function \(f(x) = ax^2 + bx + c\), derive the the valuer of \(\alpha\) under which gradient descent will converge to the minimum in one step.

Section 3.6

E3.6.1 Compute the log-odds of an event with probability \(p = 0.25\).

E3.6.2 Given the features \(\mathbf{x} = (2, -1)\) and the coefficients \(\boldsymbol{\alpha} = (0.5, -0.3)\), compute the linear combination \(\boldsymbol{\alpha}^T \mathbf{x}\).

E3.6.3 For the data point \(\mathbf{x} = (1, 3)\) with the label \(b = 1\) and the coefficients \(\boldsymbol{\alpha} = (-0.2, 0.4)\), calculate the probability \(p(\mathbf{x}; \boldsymbol{\alpha})\) using the sigmoid function.

E3.6.4 Compute the cross-entropy loss for a single data point \(\mathbf{x} = (1, 2)\) with label \(b = 0\), given \(\boldsymbol{\alpha} = (0.3, -0.4)\).

E3.6.5 Prove that the logistic function \(\sigma(z) = \frac{1}{1 + e^{-z}}\) satisfies \(\sigma'(z) = \sigma(z)(1 - \sigma(z))\).

E3.6.6 Given the feature vectors \(\alpha_1 = (1, 2)\), \(\alpha_2 = (-1, 1)\), \(\alpha_3 = (0, -1)\), and the corresponding labels \(b_1 = 1\), \(b_2 = 0\), \(b_3 = 1\), compute the logistic regression objective function \(\ell(x; A, b)\) for \(x = (1, 1)\).

E3.6.7 For the same data as in E3.6.6, compute the gradient of the logistic regression objective function \(\nabla \ell(x; A, b)\) at \(x = (1, 1)\).

E3.6.8 For the same data as in E3.6.6, compute the Hessian \(H_\ell(x; A, b)\) of the logistic regression objective function at \(x = (1, 1)\).

E3.6.9 For the same data as in E3.6.6, perform one step of gradient descent with step size \(\beta = 0.1\) starting from \(x^0 = (0, 0)\).

E3.6.10 For the same data as in E3.6.6, compute the smoothness parameter \(L\) of the logistic regression objective function.

3.7.2. Problems#

3.1 Let \(f : \mathbb{R} \to\mathbb{R}\) be twice continuously differentiable, let \(\mathbf{a}_1, \mathbf{a}_2\) be vectors in \(\mathbb{R}^d\), and let \(b_1, b_2 \in \mathbb{R}\). Consider the following real-valued function of \(\mathbf{x} \in \mathbb{R}^d\):

\[ g(\mathbf{x}) = \frac{1}{2} f(\mathbf{a}_1^T \mathbf{x} + b_1) + \frac{1}{2} f(\mathbf{a}_2^T \mathbf{x} + b_2). \]

a) Compute the gradient of \(g\) in vector form in terms of the derivative \(f'\) of \(f\). (By vector form, we mean that it is not enough to write down each element of \(\nabla g(\mathbf{x})\) separately.)

b) Compute the Hessian of \(g\) in matrix form in terms of the first derivative \(f'\) and second derivative \(f''\) of \(f\). (By matrix form, we mean that it is not enough to write down each element of \(H_g (\mathbf{x})\) or each of its columns separately.)

3.2 Give an alternative proof of the Descent Direction and Directional Derivative Lemma using the definition of the directional derivative. [Hint: Mimic the proof of the single-variable case.] \(\lhd\)

3.3 (Adapted from [CVX]) Show that if \(S_1\) and \(S_2\) are convex sets in \(\mathbb{R}^{m+n}\), then so is their partial sum

\[ S = \{(\mathbf{x}, \mathbf{y}_1+\mathbf{y}_2) \,|\, \mathbf{x} \in \mathbb{R}^m, \mathbf{y}_1, \mathbf{y}_2 \in \mathbb{R}^n, (\mathbf{x},\mathbf{y}_1) \in S_1, (\mathbf{x},\mathbf{y}_2)\in S_2\}. \]

\(\lhd\)

3.4 A convex combination of \(\mathbf{z}_1, \ldots, \mathbf{z}_m \in \mathbb{R}^d\) is a linear combination of the form

\[ \mathbf{w} = \sum_{i=1}^m \alpha_i \mathbf{z}_i \]

where \(\alpha_i \geq 0\) for all \(i\) and \(\sum_{i=1}^m \alpha_i = 1\). Show that a set is convex if and only if it contains all convex combinations of its elements. [Hint: Use induction on \(m\).] \(\lhd\)

3.5 A set \(C \subseteq \mathbb{R}^d\) is a cone if, for all \(\mathbf{x} \in C\) and all \(\alpha \geq 0\), \(\alpha \mathbf{x} \in C\). A conic combination of \(\mathbf{z}_1, \ldots, \mathbf{z}_m \in \mathbb{R}^d\) is a linear combination of the form

\[ \mathbf{w} = \sum_{i=1}^m \alpha_i \mathbf{z}_i \]

where \(\alpha_i \geq 0\) for all \(i\). Show that a set \(C\) is a convex cone if and only if it contains all conic combinations of its elements. \(\lhd\)

3.6 Show that any linear subspace of \(\mathbb{R}^d\) is convex. \(\lhd\)

3.7 Show that the set

\[ \mathbb{R}^d_+ = \{\mathbf{x} \in \mathbb{R}^d\,:\, \mathbf{x} \geq \mathbf{0}\} \]

is convex. \(\lhd\)

3.8 Let \(f : \mathbb{R}^d \to \mathbb{R}\) be a strictly convex function. Assume further that there is a global minimizer \(\mathbf{x}^*\). Show that it is unique. \(\lhd\)

3.9 Show that \(\mathbf{S}_+^n\) is a convex cone. \(\lhd\)

3.10 Prove (a)-(e) from the Operations that Preserve Convexity Lemma. \(\lhd\)

3.11 Let \(f \,:\, \mathbb{R}^d \to \mathbb{R}\) be a function. The eipgraph of \(f\) is the set

\[ \mathbf{epi} f = \{(\mathbf{x}, y)\,:\, \mathbf{x} \in \mathbb{R}^d, y \geq f(\mathbf{x})\}. \]

Show that \(f\) is convex if and only if \(\mathbf{epi} f\) is a convex set. \(\lhd\)

3.12 Let \(f_1, \ldots, f_m : \mathbb{R}^d \to \mathbb{R}\) be convex functions and let \(\beta_1, \ldots, \beta_m \geq 0\). Show that

\[ f(\mathbf{x}) = \sum_{i=1}^m \beta_i f_i(\mathbf{x}) \]

is convex. \(\lhd\)

3.13 Let \(f_1, f_2 : \mathbb{R}^d \to \mathbb{R}\) be convex functions. Show that the pointwise maximum function

\[ f(\mathbf{x}) = \max\{f_1(\mathbf{x}), f_2(\mathbf{x})\} \]

is convex. [Hint: First show that \(\max\{\alpha + \beta, \eta + \phi\} \leq \max\{\alpha, \eta\} + \max\{\beta, \phi\}\).]\(\lhd\)

3.14 Prove the following composition theorem: if \(f : \mathbb{R}^d \to \mathbb{R}\) is convex and \(g : \mathbb{R} \to \mathbb{R}\) is convex and nondecreasing, then the composition \(h = g \circ f\) is convex. \(\lhd\)

3.15 Show that \(f(x) = e^{\beta x}\), where \(x, \beta \in \mathbb{R}\), is convex. \(\lhd\)

3.16 Let \(f : \mathbb{R} \to \mathbb{R}\) be twice continuously differentiable. Show that \(f\) is \(L\)-smooth if \(|f''(x)| \leq L\) for all \(x \in \mathbb{R}\). \(\lhd\)

3.17 For \(a \in [-1,1]\) and \(b \in \{0,1\}\), let \(\hat{f}(x, a) = \sigma(x a)\) where

\[ \sigma(t) = \frac{1}{1 + e^{-t}} \]

be a classifier parametrized by \(x \in \mathbb{R}\). For a dataset \(a_i \in [-1,1]\) and \(b_i \in \{0,1\}\), \(i=1,\ldots,n\), let the cross-entropy loss be

\[ \mathcal{L}(x, \{(a_i, b_i)\}_{i=1}^n) = \frac{1}{n} \sum_{i=1}^n \ell(x, a_i, b_i) \]

where

\[ \ell(x, a, b) = - b \log(\hat{f}(x, a)) - (1-b) \log(1 - \hat{f}(x, a)). \]

a) Show that \(\sigma'(t) = \sigma(t)(1- \sigma(t))\) for all \(t \in \mathbb{R}\).

b) Use (a) to show that

\[ \frac{\partial}{\partial x} \mathcal{L}(x, \{(a_i, b_i)\}_{i=1}^n) = - \frac{1}{n} \sum_{i=1}^n (b_i - \hat{f}(x,a_i)) \,a_i. \]

c) Use (b) to show that

\[ \frac{\partial^2}{\partial x^2} \mathcal{L}(x, \{(a_i, b_i)\}_{i=1}^n) = \frac{1}{n} \sum_{i=1}^n \hat{f}(x,a_i) \,(1 - \hat{f}(x,a_i)) \,a_i^2. \]

d) Use © to show that, for any dataset \(\{(a_i, b_i)\}_{i=1}^n\), \(\mathcal{L}\) is convex and \(1\)-smooth as a function of \(x\). \(\lhd\)

3.18 Prove the Quadratic Bound for Strongly Convex Functions. [Hint: Adapt the proof of the Quadratic Bound for Smooth Functions.] \(\lhd\)

3.19 Show that \(\log x \leq x - 1\) for all \(x > 0\). [Hint: Compute the derivative of \(s(x) = x - 1 - \log x\) and the value \(s(1)\).] \(\lhd\)

3.20 Consider the affine function

\[ f(\mathbf{x}) = \mathbf{q}^T \mathbf{x} + r \]

where \(\mathbf{x} = (x_1, \ldots, x_d)^T, \mathbf{q} = (q_1, \ldots, q_d)^T \in \mathbb{R}^d\) and \(r \in \mathbb{R}\). Fix a value \(c \in \mathbb{R}\). Assume that

\[ f(\mathbf{x}_0) = f(\mathbf{x}_1) = c. \]

Show that

\[ \nabla f(\mathbf{x}_0)^T(\mathbf{x}_1 - \mathbf{x}_0) = 0. \]

\(\lhd\)

3.21 Let \(f : D_1 \to \mathbb{R}\), where \(D_1 \subseteq \mathbb{R}^d\), and let \(\mathbf{g} : D_2 \to \mathbb{R}^d\), where \(D_2 \subseteq \mathbb{R}\). Assume that \(f\) is continuously differentiable at \(\mathbf{g}(t_0)\), an interior point of \(D_1\), and that \(\mathbf{g}\) is continuously differentiable at \(t_0\), an interior point of \(D_2\). Assume further that there is a value \(c \in \mathbb{R}\) such that

\[ f \circ \mathbf{g}(t) = c, \qquad \forall t \in D_2. \]

Show that \(\mathbf{g}'(t_0)\) is orthogonal to \(\nabla f(\mathbf{g}(t_0))\). \(\lhd\)

3.22 Fix a partition \(C_1,\ldots,C_k\) of \([n]\). Under the \(k\)-means objective, its cost is

\[ \mathcal{G}(C_1,\ldots,C_k) = \min_{\boldsymbol{\mu}_1,\ldots,\boldsymbol{\mu}_k \in \mathbb{R}^d} G(C_1,\ldots,C_k; \boldsymbol{\mu}_1, \ldots, \boldsymbol{\mu}_k) \]

where

\[\begin{align*} G(C_1,\ldots,C_k; \boldsymbol{\mu}_1, \ldots, \boldsymbol{\mu}_k) &= \sum_{i=1}^k \sum_{j \in C_i} \|\mathbf{x}_j - \boldsymbol{\mu}_i\|^2 = \sum_{j=1}^n \|\mathbf{x}_j - \boldsymbol{\mu}_{c(j)}\|^2, \end{align*}\]

with \(\boldsymbol{\mu}_i \in \mathbb{R}^d\), the center of cluster \(C_i\).

a) Suppose \(\boldsymbol{\mu}_i^*\) is a global minimizer of

\[ F_i(\boldsymbol{\mu}_i) = \sum_{j \in C_i} \|\mathbf{x}_j - \boldsymbol{\mu}_i\|^2, \]

for each \(i\). Show that \(\boldsymbol{\mu}_1^*, \ldots, \boldsymbol{\mu}_k^*\) is a global minimizer of \(G(C_1,\ldots,C_k; \boldsymbol{\mu}_1, \ldots, \boldsymbol{\mu}_k)\).

b) Compute the gradient of \(F_i(\boldsymbol{\mu}_i)\).

c) Find the stationary points of \(F_i(\boldsymbol{\mu}_i)\).

d) Compute the Hessian of \(F_i(\boldsymbol{\mu}_i)\).

e) Show that \(F_i(\boldsymbol{\mu}_i)\) is convex.

f) Compute the global minimizer of \(G(C_1,\ldots,C_k; \boldsymbol{\mu}_1, \ldots, \boldsymbol{\mu}_k)\). Justify your answer. \(\lhd\)

3.23 Consider the quadratic function

\[ f(\mathbf{x}) = \frac{1}{2} \mathbf{x}^T P \mathbf{x} + \mathbf{q}^T \mathbf{x} + r \]

where \(P\) is symmetric. Show that, if \(P\) is positive definite, then \(f\) is strongly convex. \(\lhd\)

3.24 A closed halfspace is a set of the form

\[ K = \{\mathbf{x} \in \mathbb{R}^d\,:\,\mathbf{a}^T \mathbf{x} \leq b\}, \]

for some \(\mathbf{a} \in \mathbb{R}^d\) and \(b \in \mathbb{R}\).

a) Show that \(K\) is convex.

b) Show that there exists \(\mathbf{x}_0 \in \mathbb{R}^d\) such that

\[ K = \{\mathbf{x} \in \mathbb{R}^d\,:\,\mathbf{a}^T (\mathbf{x}-\mathbf{x}_0) \leq 0\}. \]

\(\lhd\)

3.25 A hyperplane is a set of the form

\[ H = \{\mathbf{x} \in \mathbb{R}^d\,:\,\mathbf{a}^T \mathbf{x} = b\}, \]

for some \(\mathbf{a} \in \mathbb{R}^d\) and \(b \in \mathbb{R}\).

a) Show that \(H\) is convex.

b) Assume \(b = 0\). Compute \(H^\perp\) and justify your answer. \(\lhd\)

3.26 Recall that the \(\ell^1\)-norm is defined as

\[ \|\mathbf{x}\|_1 = \sum_{i=1}^d |x_i|. \]

a) Show that the \(\ell^1\)-norm satisfies the triangle inequality.

b) Show that the closed \(\ell^1\) ball

\[ \{\mathbf{x} \in \mathbb{R}^d\,:\,\|\mathbf{x}\|_1 \leq r\}, \]

for some \(r > 0\), is convex. \(\lhd\)

3.27 The Lorentz cone is the set

\[ C = \left\{ (\mathbf{x},t) \in \mathbb{R}^d \times \mathbb{R}\,:\, \|\mathbf{x}\|_2 \leq t \right\}. \]

Show that \(C\) is convex. \(\lhd\)

3.28 A polyhedron is a set of the form

\[ K = \left\{ \mathbf{x} \in \mathbb{R}^d\,:\, \mathbf{a}_i^T \mathbf{x} \leq b_i, i=1,\ldots,m, \text{ and } \mathbf{c}_j^T \mathbf{x} = d_j, j=1,\ldots,n \right\}, \]

for some \(\mathbf{a}_i \in \mathbb{R}^d\), \(b_i \in \mathbb{R}\), \(i=1,\ldots,m\) and \(\mathbf{c}_j \in \mathbb{R}^d\), \(d_j \in \mathbb{R}\), \(j=1,\ldots,n\). Show that \(K\) is convex. \(\lhd\)

3.29 The probability simplex in \(\mathbb{R}^d\) is the set

\[ \Delta = \left\{ \mathbf{x} \in \mathbb{R}^d\,:\, \mathbf{x} \geq \mathbf{0} \text{ and } \mathbf{x}^T \mathbf{1} = 1 \right\}. \]

Show that

\(\Delta\) is convex. \(\lhd\)

3.30 Consider the function

\[ f^*(\mathbf{y}) = \sup_{\mathbf{x} \in \mathbf{R}^d} \left\{ \mathbf{y}^T\mathbf{x} - \|\mathbf{x}\|_2 \right\}. \]

a) Show that if \(\|\mathbf{y}\|_2 \leq 1\) then \(f^*(\mathbf{y}) = 0\).

b) Show that if \(\|\mathbf{y}\|_2 > 1\) then \(f^*(\mathbf{y}) = +\infty\). \(\lhd\)

3.31 Let \(f, g : D \to \mathbb{R}\) for some \(D \subseteq \mathbb{R}^d\) and let \(\mathbf{x}_0\) be an interior point of \(D\). Assume \(f\) and \(g\) are continuously differentiable at \(\mathbf{x}_0\). Let \(h(\mathbf{x}) = f(\mathbf{x}) g(\mathbf{x})\) for all \(\mathbf{x}\) in \(D\). Compute \(\nabla h(\mathbf{x}_0)\). [Hint: You can make use of the corresponding single variable result.] \(\lhd\)

3.32 Let \(f : D \to \mathbb{R}\) for some \(D \subseteq \mathbb{R}^d\) and let \(\mathbf{x}_0\) be an interior point of \(D\) where \(f(\mathbf{x}_0) \neq 0\). Assume \(f\) is continuously differentiable at \(\mathbf{x}_0\). Compute \(\nabla (1/f)\). [Hint: You can make use of the corresponding single variable result without proof.] \(\lhd\)

3.33 Let \(f, g : D \to \mathbb{R}\) for some \(D \subseteq \mathbb{R}^d\) and let \(\mathbf{x}_0\) be an interior point of \(D\). Assume \(f\) and \(g\) are twice continuously differentiable at \(\mathbf{x}_0\). Let \(h(\mathbf{x}) = f(\mathbf{x}) g(\mathbf{x})\) for all \(\mathbf{x}\) in \(D\). Compute \(\mathbf{H}_h (\mathbf{x}_0)\). \(\lhd\)

3.34 (Adapted from [Rud]) Let \(f : D \to \mathbb{R}\) for some convex open set \(D \subseteq \mathbb{R}^d\). Assume that \(f\) is continuously differentiable everywhere on \(D\) and that further \(\frac{\partial f}{\partial x_1}(\mathbf{x}_0) = 0\) for all \(\mathbf{x}_0 \in D\). Show that \(f\) depends only on \(x_2,\ldots,x_d\). \(\lhd\)

3.35 (Adapted from [Rud]) Let \(\mathbf{g} : D \to \mathbb{R}^d\) for some open set \(D \subseteq \mathbb{R}\). Assume that \(\mathbf{g}\) is continuously differentiable everywhere on \(D\) and that further

\[ \|\mathbf{g}(t)\|^2 = 1, \qquad \forall t \in D. \]

Show that \(\mathbf{g}'(t)^T \mathbf{g}(t) = 0\) for all \(t \in D\). [Hint: Use composition.] \(\lhd\)

3.36 (Adapted from [Khu]) Let \(f : D \to \mathbb{R}\) for some open set \(D \subseteq \mathbb{R}^d\). Assume that \(f\) is continuously differentiable everywhere on \(D\) and that further

\[ f(t \mathbf{x}) = t^n f(\mathbf{x}), \]

for any \(\mathbf{x} \in D\) and any scalar \(t\) such that \(t \mathbf{x} \in D\). In that case, \(f\) is said to be homogeneous of degree \(n\). Prove that for all \(\mathbf{x} \in D\) we have

\[ \mathbf{x}^T \nabla f(\mathbf{x}) = n f(\mathbf{x}). \]

\(\lhd\)

3.37 (Adapted from [Khu]) Apply Taylor’s Theorem in the neighborhood of \(\mathbf{0}\) to the function

\[ f(x_1, x_2) = \exp(x_2 \sin x_1). \]

[Hint: You can use standard formulas for the derivatives of single-variable functions without proof.] \(\lhd\)

3.38 (Adapted from [Khu]) Apply Taylor’s Theorem in the neighborhood of \(\mathbf{0}\) to the function

\[ f(x_1, x_2) = \cos(x_1 x_2). \]

[Hint: You can use standard formulas for the derivatives of single-variable functions without proof.] \(\lhd\)

3.39 (Adapted from [Khu]) Apply Taylor’s Theorem in the neighborhood of \(\mathbf{0}\) to the function

\[ f(x_1, x_2, x_3) = \sin(e^{x_1} + x_2^2 + x_3^3). \]

[Hint: You can use standard formulas for the derivatives of single-variable functions without proof.] \(\lhd\)

3.40 (Adapted from [Khu]) Consider the function

\[ f(x_1, x_2) = (x_2 - x_1^2)(x_2 - 2 x_1^2). \]

a) Compute the gradient and Hessian of \(f\) at \((0,0)\).

b) Show that \((0,0)\) is not a strict local minimizer of \(f\). [Hint: Consider a parametric curve of the form \(x_1 = t^\alpha\) and \(x_2 = t^\beta\).]

c) Let \(\mathbf{g}(t) = \mathbf{a} t\) for some nonzero vector \(\mathbf{a} = (a_1,a_2)^T \in \mathbb{R}^2\). Show that \(t=0\) is a strict local minimizer of \(h(t) = f(\mathbf{g}(t))\). \(\lhd\)

3.41 Suppose \(f : \mathbb{R}^d \to \mathbb{R}\) is convex. Let \(A \in \mathbb{R}^{d \times m}\) and \(\mathbf{b} \in \mathbb{R}^d\). Show that

\[ g(\mathbf{x}) = f(A \mathbf{x} + \mathbf{b}) \]

is convex over \(\mathbb{R}^m\). \(\lhd\)

3.42 (Adapted from [CVX]) For \(\mathbf{x} = (x_1,\ldots,x_n)\in\mathbb{R}^n\), let

\[ x_{[1]} \geq x_{[2]} \geq \cdots \geq x_{[n]}, \]

be the coordinates of \(\mathbf{x}\) in non-increasing order. Let \(1 \leq r \leq n\) be an integer and \(w_1 \geq w_2 \geq \cdots \geq w_r \geq 0\). Show that

\[ f(\mathbf{x}) = \sum_{i=1}^r w_i x_{[i]}, \]

is convex. \(\lhd\)

3.43 For a fixed \(\mathbf{z} \in \mathbb{R}^d\), let

\[ f(\mathbf{x}) = \|\mathbf{x} - \mathbf{z}\|_2^2. \]

a) Compute the gradient and Hessian of \(f\).

b) Show that \(f\) is strongly convex.

c) For a finite set \(Z \subseteq \mathbb{R}^d\), let

\[ g(\mathbf{x}) = \max_{\mathbf{z} \in Z} \|\mathbf{x} - \mathbf{z}\|_2^2. \]

Use Problem 3.13 to show that \(g\) is convex. \(\lhd\)

3.44 Let \(S_i\), \(i \in I\), be a collection of convex subsets of \(\mathbb{R}^d\). Here \(I\) can be infinite. Show that

\[ \bigcap_{i \in I} S_i, \]

is convex. \(\lhd\)

3.45 a) Let \(f_i\), \(i \in I\), be a collection of convex real-valued functions over \(\mathbb{R}^d\). Use Problems 3.11, 3.44 to show that

\[ g(\mathbf{x}) = \sup_{i \in I} f_i(\mathbf{x}), \]

is convex.

b) Let \(f : \mathbb{R}^{d+f} \to \mathbb{R}\) be a convex function and let \(S \subseteq \mathbb{R}^{f}\) be a (not necessarily convex) set. Use (a) to show that the function

\[ g(\mathbf{x}) = \sup_{\mathbf{y} \in S} f(\mathbf{x},\mathbf{y}), \]

is convex. You can assume that \(g(\mathbf{x}) < +\infty\) for all \(\mathbf{x} \in \mathbb{R}^d\). \(\lhd\)

3.46 Use Problem 3.45 to show that the largest eigenvalue of a matrix is a convex function over the set of all symmetric matrices. [Hint: First show that \(\langle \mathbf{x}, A\mathbf{x}\rangle\) is a linear function of \(A\). \(\lhd\)

3.47 Let \(f : \mathbb{R}^d \to \mathbb{R}\) be a convex function. Show that the set of global minimizers is convex. \(\lhd\)

3.48 Consider the set of symmetric \(n \times n\) matrices

\[ \mathbf{S}^n = \left\{ X \in \mathbb{R}^{n \times n}\,:\, X = X^T \right\}. \]

a) Show that \(\mathbf{S}^n\) is a linear subspace of the vector space of \(n \times n\) matrices in the sense that for all \(X_1, X_2 \in \mathbf{S}^n\) and \(\alpha \in \mathbb{R}\)

\[ X_1 + X_2 \in \mathbf{S}^n, \]

and

\[ \alpha X_1 \in \mathbf{S}^n. \]

b) Show that there exists a basis of \(\mathbf{S}^n\) of size \(d = {n \choose 2} + n\), that is, a collection of symmetric matrices \(X_1,\ldots,X_d \in \mathbf{S}^n\) such that any matrix \(Y \in \mathbf{S}^n\) can be written as a linear combination

\[ Y = \sum_{i=1}^d \alpha_i X_i, \]

for some \(\alpha_1,\ldots,\alpha_d \in \mathbb{R}\).

c) Show that the matrices \(X_1,\ldots,X_d \in \mathbf{S}^n\) you constructed in (b) are linearly independent in the sense that

\[ \sum_{i=1}^d \alpha_i X_i = \mathbf{0}_{n \times n} \quad \implies \quad \alpha_1 = \cdots = \alpha_d = 0. \]

\(\lhd\)

3.49 Let \(f : D \to \mathbb{R}\) be a convex function over the set

\[ D = \{\mathbf{x} = (x_1,\ldots,x_d) \in \mathbb{R}^d: x_i \geq 0, \forall i\}. \]

a) Show that \(D\) is convex.

b) Show that the First-Order Optimality Conditions for a Convex Functions on Convex Sets reduces to

\[ \frac{\partial f(\mathbf{x}_0)}{\partial x_i} \geq 0, \qquad \forall i \in [d] \]

and

\[ \frac{\partial f(\mathbf{x}_0)}{\partial x_i} = 0, \qquad \text{if $x_i^0 > 0$}. \]

[Hint: Choose the right \(\mathbf{y}\) in the condition of the theorem.] \(\lhd\)

3.50 Let \(f : \mathbb{R}^d \to \mathbb{R}\) be twice continuously differentiable and \(m\)-strongly convex with \(m>0\).

a) Show that

\[ f(\mathbf{y}_n) \to +\infty, \]

along any sequence \((\mathbf{y}_n)\) such that \(\|\mathbf{y}_n\| \to +\infty\) as \(n \to +\infty.\)

b) Show that, for any \(\mathbf{x}^0\), the set

\[ C = \{\mathbf{x} \in \mathbb{R}^d\,:\,f(\mathbf{x}) \leq f(\mathbf{x}^0)\}, \]

is closed and bounded.

c) Show that there exists a unique global minimizer \(\mathbf{x}^*\) of \(f\). \(\lhd\)

3.51 Suppose that \(f : \mathbb{R}^d \to \mathbb{R}\) is \(L\)-smooth and \(m\)-strongly convex with a global minimizer at \(\mathbf{x}^*\). Apply gradient descent with step size \(\alpha = 1/L\) started from an arbitrary point \(\mathbf{x}^0\).

a) Show that \(\mathbf{x}^t \to \mathbf{x}^*\) as \(t \to +\infty\). [Hint: Use the Quadratic Bound for Strongly Convex Functions at \(\mathbf{x}^*\).]

b) Give a bound on \(\|\mathbf{x}^t - \mathbf{x}^*\|\) depending on \(t, m, L, f(\mathbf{x}^0)\) and \(f(\mathbf{x})^*\).

\(\lhd\)

3.52 Let \(f : \mathbb{R}^d \to \mathbb{R}\) be twice continuously differentiable.

a) Show that

\[ \nabla f(\mathbf{y}) = \nabla f(\mathbf{x}) + \int_{0}^1 \mathbf{H}_f(\mathbf{x} + \xi \mathbf{p}) \,\mathbf{p} \,\mathrm{d}\xi, \]

where \(\mathbf{p} = \mathbf{y} - \mathbf{x}\). Here the integral of a vector-valued function is performed entry-wise. [Hint: Let \(g_i(\xi) = \frac{\partial}{\partial x_i} f(\mathbf{x} + \xi \mathbf{p})\). The fundamental theorem of calculus implies that \(g_i(1) - g_i(0) = \int_{0}^1 g_i'(\xi) \,\mathrm{d}\xi.\)]

b) Assume \(f\) is \(L\)-smooth. Use (a) to show that the gradient of \(f\) is \(L\)-Lipschitz in the sense that

\[ \|\nabla f(\mathbf{y}) - \nabla f(\mathbf{x})\|_2 \leq L \|\mathbf{y} - \mathbf{x}\|_2, \qquad \forall \mathbf{x}, \mathbf{y} \in \mathbb{R}^d. \]

\(\lhd\)

3.53 Consider the quadratic function

\[ f(\mathbf{x}) = \frac{1}{2} \mathbf{x}^T P \mathbf{x} + \mathbf{q}^T \mathbf{x} + r, \]

where \(P\) is a symmetric matrix. Use the First-Order Convexity Condition to show that \(f\) is convex if and only \(P\) is positive semidefinite. \(\lhd\)

3.54 Let

\[ \ell(\mathbf{x}; A, \mathbf{b}) = \frac{1}{n} \sum_{i=1}^n \left\{- b_i \log(\sigma(\boldsymbol{\alpha}_i^T \mathbf{x})) - (1-b_i) \log(1- \sigma(\boldsymbol{\alpha}_i^T \mathbf{x}))\right\}. \]

be the loss function in logistic regression, where \(A \in \mathbb{R}^{n \times d}\) has rows \(\boldsymbol{\alpha}_i^T\). Show that

\[ \mathbf{H}_{\ell}(\mathbf{x}; A, \mathbf{b}) \preceq \frac{1}{4n} A^T A. \]

[Hint: Read the proof of the Smoothness of logistic regression lemma.] \(\lhd\)