Exercises

$\newcommand{\bmu}{\boldsymbol{\mu}}$ $\newcommand{\bSigma}{\boldsymbol{\Sigma}}$ $\newcommand{\bfbeta}{\boldsymbol{\beta}}$ $\newcommand{\bflambda}{\boldsymbol{\lambda}}$ $\newcommand{\bgamma}{\boldsymbol{\gamma}}$ $\newcommand{\bsigma}{{\boldsymbol{\sigma}}}$ $\newcommand{\bpi}{\boldsymbol{\pi}}$ $\newcommand{\btheta}{{\boldsymbol{\theta}}}$ $\newcommand{\bphi}{\boldsymbol{\phi}}$ $\newcommand{\balpha}{\boldsymbol{\alpha}}$ $\newcommand{\blambda}{\boldsymbol{\lambda}}$ $\renewcommand{\P}{\mathbb{P}}$ $\newcommand{\E}{\mathbb{E}}$ $\newcommand{\indep}{\perp\!\!\!\perp} \newcommand{\bx}{\mathbf{x}}$ $\newcommand{\bp}{\mathbf{p}}$ $\renewcommand{\bx}{\mathbf{x}}$ $\newcommand{\bX}{\mathbf{X}}$ $\newcommand{\by}{\mathbf{y}}$ $\newcommand{\bY}{\mathbf{Y}}$ $\newcommand{\bz}{\mathbf{z}}$ $\newcommand{\bZ}{\mathbf{Z}}$ $\newcommand{\bw}{\mathbf{w}}$ $\newcommand{\bW}{\mathbf{W}}$ $\newcommand{\bv}{\mathbf{v}}$ $\newcommand{\bV}{\mathbf{V}}$ $\newcommand{\bfg}{\mathbf{g}}$ $\newcommand{\bfh}{\mathbf{h}}$ $\newcommand{\horz}{\rule[.5ex]{2.5ex}{0.5pt}}$ $\renewcommand{\S}{\mathcal{S}}$ $\newcommand{\X}{\mathcal{X}}$ $\newcommand{\var}{\mathrm{Var}}$ $\newcommand{\pa}{\mathrm{pa}}$ $\newcommand{\Z}{\mathcal{Z}}$ $\newcommand{\bh}{\mathbf{h}}$ $\newcommand{\bb}{\mathbf{b}}$ $\newcommand{\bc}{\mathbf{c}}$ $\newcommand{\cE}{\mathcal{E}}$ $\newcommand{\cP}{\mathcal{P}}$ $\newcommand{\bbeta}{\boldsymbol{\beta}}$ $\newcommand{\bLambda}{\boldsymbol{\Lambda}}$ $\newcommand{\cov}{\mathrm{Cov}}$ $\newcommand{\bfk}{\mathbf{k}}$ $\newcommand{\idx}[1]{}$ $\newcommand{\xdi}{}$

3.7. Exercises#

3.7.1. Warm-up worksheets#

(with help from Claude, Gemini, and ChatGPT)

Section 3.2

E3.2.1 Calculate the gradient of the function $f(x_1, x_2) = 3x_1^2 - 2x_1x_2 + 4x_2^2 - 5x_1 + 2x_2$ at the point $(1, -1)$.

E3.2.2 Find the Hessian matrix of the function $f(x_1, x_2, x_3) = 2x_1x_2 + 3x_2x_3 - x_1^2 - 4x_3^2$.

E3.2.3 Compute the partial derivatives $\frac{\partial f}{\partial x_1}$ and $\frac{\partial f}{\partial x_2}$ for the function $f(x_1, x_2) = \sin(x_1) \cos(x_2)$ at the point $(\frac{\pi}{4}, \frac{\pi}{3})$.

E3.2.4 Let $f(x_1, x_2) = e^{x_1} + x_1x_2 - x_2^2$ and $\mathbf{g}(t) = (t^2, t)$. Calculate $(f \circ \mathbf{g})'(1)$ using the Chain Rule.

E3.2.5 Verify the Symmetry of the Hessian for the function $f(x_1, x_2) = x_1^3 + 3x_1x_2^2 - 2x_2^3$ at the point $(1, 2)$.

E3.2.6 Find the gradient of the function $f(x_1, x_2, x_3) = 2x_1 - 3x_2 + 4x_3$ at the point $(1, -2, 3)$.

E3.2.7 Calculate the second-order partial derivatives of the function $f(x_1, x_2) = x_1^2 \sin(x_2)$.

E3.2.8 Compute the gradient of the quadratic function $f(x_1, x_2) = 3x_1^2 + 2x_1x_2 - x_2^2 + 4x_1 - 2x_2$ at the point $(1, -1)$.

E3.2.9 Find the Hessian matrix of the function $f(x_1, x_2, x_3) = x_1^2 + 2x_2^2 + 3x_3^2 - 2x_1x_2 + 4x_1x_3 - 6x_2x_3$.

E3.2.10 Apply the multivariable Mean Value Theorem to the function $f(x_1, x_2) = x_1^2 + x_2^2$ on the line segment joining the points $(0, 0)$ and $(1, 2)$.

E3.2.11 Find the partial derivatives $\frac{\partial f}{\partial x}$ and $\frac{\partial f}{\partial y}$ of the function $f(x, y) = x^3y^2 - 2xy^3 + y^4$.

E3.2.12 Compute the gradient $\nabla f(x, y)$ of the function $f(x, y) = \ln(x^2 + 2y^2)$ at the point $(1, 2)$.

E3.2.13 Find the Hessian matrix of the function $g(x, y) = \sin(x) \cos(y)$.

E3.2.14 If $p(x, y, z) = x^2yz + 3xyz^2$, find all second partial derivatives of $p$.

E3.2.15 Verify that the function $q(x, y) = x^3 - 3xy^2$ satisfies Laplace’s equation: $\frac{\partial^2 q}{\partial x^2} + \frac{\partial^2 q}{\partial y^2} = 0$.

E3.2.16 Consider the function $s(x, y) = x^2 + 4xy + 5y^2$. Use the Mean Value Theorem (with $\xi = 0$) to approximate $s(1.1, 0.9)$ using the point $(1, 1)$.

E3.2.17 A particle moves along a path given by $\mathbf{c}(t) = (t^2, t^3)$. The temperature at a point $(x, y)$ is given by $u(x, y) = e^{-x^2 - y^2}$. Find the rate of change of the temperature experienced by the particle at time $t = 1$.

E3.2.18 Apply the Chain Rule to find $\frac{d}{dt} f(\mathbf{g}(t))$ for $f(x, y) = x^2 + y^2$ and $\mathbf{g}(t) = (t, \sin t)$.

E3.2.19 Use the Chain Rule to find $\frac{d}{dt} f(\mathbf{g}(t))$ for $f(x, y) = xy$ and $\mathbf{g}(t) = (t^2, \cos t)$.

Section 3.3

E3.3.1 Consider the function $f(x_1, x_2) = x_1^2 + 2x_2^2$. Find all points $(x_1, x_2)$ where $\nabla f(x_1, x_2) = 0$.

E3.3.2 Let $f(x_1, x_2) = x_1^2 - x_2^2$. Find the directional derivative of $f$ at the point $(1, 1)$ in the direction $\mathbf{v} = (\frac{1}{\sqrt{2}}, \frac{1}{\sqrt{2}})$.

E3.3.3 Consider the function $f(x_1, x_2) = x_1^2 + 2x_1x_2 + x_2^2$. Find the second directional derivative of $f$ at the point $(1, 1)$ in the direction $\mathbf{v} = (\frac{1}{\sqrt{2}}, \frac{1}{\sqrt{2}})$.

E3.3.4 Consider the optimization problem $\min_{x_1, x_2} x_1^2 + x_2^2$ subject to $x_1 + x_2 = 1$. Write down the Lagrangian function $L(x_1, x_2, \lambda)$.

E3.3.5 For the optimization problem in E3.3.4, find all points $(x_1, x_2, \lambda)$ satisfying the first-order necessary conditions.

E3.3.6 For the optimization problem in E3.3.4, check the second-order sufficient conditions at the point $(\frac{1}{2}, \frac{1}{2}, -1)$.

E3.3.7 Consider the function $f(x_1, x_2, x_3) = x_1^2 + x_2^2 + x_3^2$. Find all points $(x_1, x_2, x_3)$ satisfying the first-order necessary conditions for the optimization problem $\min_{x_1, x_2, x_3} f(x_1, x_2, x_3)$ subject to $x_1 + 2x_2 + 3x_3 = 6$.

E3.3.8 For the optimization problem in E3.3.7, check the second-order sufficient conditions at the point $(\frac{1}{2}, 1, \frac{3}{2}, -1)$.

E3.3.9 Let $f: \mathbb{R}^2 \to \mathbb{R}$ be defined by $f(x_1, x_2) = x_1^3 - 3x_1x_2^2$. Determine whether the vector $\mathbf{v} = (1, 1)$ is a descent direction for $f$ at the point $(1, 0)$.

E3.3.10 Let $f: \mathbb{R}^2 \to \mathbb{R}$ be defined by $f(x_1, x_2) = x_1^2 + x_2^2 - 2x_1 - 4x_2 + 5$. Find the Hessian matrix of $f$.

E3.3.11 Let $f: \mathbb{R}^2 \to \mathbb{R}$ be defined by $f(x_1, x_2) = -x_1^2 - x_2^2$. Compute the second directional derivative of $f$ at the point $(0, 0)$ in the direction $\mathbf{v} = (1, 0)$.

E3.3.12 Calculate the directional derivative of $f(x,y) = x^2 + y^2$ at the point $(1,1)$ in the direction of $\mathbf{v} = (1,1)$.

E3.3.13 Determine the Hessian matrix of $f(x,y) = x^3 + y^3 - 3xy$ at the point $(1,1)$.

Section 3.4

E3.4.1 Find the convex combination of points $(2, 3)$ and $(4, 5)$ with $\alpha = 0.3$.

E3.4.2 Determine whether the following set is convex:

\[ S = \{(x_1, x_2) \in \mathbb{R}^2 : x_1^2 + x_2^2 \leq 1\}. \]

E3.4.3 Let $S_1 = \{x \in \mathbb{R}^2 : x_1 + x_2 \leq 1\}$ and $S_2 = \{x \in \mathbb{R}^2 : x_1 - x_2 \leq 1\}$. Show directly that $S_1 \cap S_2$ is a convex set.

E3.4.4 Verify that the function $f(\mathbf{x}) = \log(e^{x_1} + e^{x_2})$ is convex using the Hessian matrix.

E3.4.5 Find the global minimizer of the function $f(x) = x^2 + 2x + 1$.

E3.4.6 Consider the function $f(x) = |x|$. Show that $f$ is convex but not differentiable at $x = 0$.

E3.4.7 Verify that the function $f(x, y) = x^2 + 2y^2$ is strongly convex with $m = 2$.

E3.4.8 Find the projection of the point $x = (1, 2)$ onto the set $D = \{(x_1, x_2) \in \mathbb{R}^2 : x_1 + x_2 = 1\}$.

E3.4.9 Let $f(x) = x^4 - 2x^2 + 1$. Determine whether $f$ is convex.

E3.4.10 Let $f(x) = e^x$. Determine whether $f$ is log-concave, i.e., $\log f$ is concave.

E3.4.11 Let $f(x) = x^2 - 2x + 2$. Determine whether $f$ is strongly convex.

E3.4.12 Let $A = \begin{pmatrix} 1 & 2 \\ 2 & 5 \end{pmatrix}$ and $\mathbf{b} = \begin{pmatrix} 1 \\ 0 \end{pmatrix}$. Find the vector $x$ that minimizes $\|A\mathbf{x} - \mathbf{b}\|_2^2$.

E3.4.13 Given the set $D = \{(x, y) \in \mathbb{R}^2 : x^2 + y^2 < 4\}$, determine if $D$ is convex.

E3.4.14 Determine if the set $D = \{(x, y) \in \mathbb{R}^2 : x + y \leq 1\}$ is convex.

E3.4.15 Determine if the set $D = \{(x, y) \in \mathbb{R}^2 : x^2 - y^2 \leq 1\}$ is convex.

Section 3.5

E3.5.1 Given the function $f(x, y) = x^2 + 4y^2$, find the direction of steepest descent at the point $(1, 1)$.

E3.5.2 Consider the function $f(x) = x^2 + 4x + 5$. Compute the derivative f’(x) and find the direction of steepest descent at $x_0 = 1$.

E3.5.3 Given the function $f(x) = x^3 - 6x^2 + 9x + 2$, perform two iterations of gradient descent starting from $x_0 = 0$ with a step size of $\alpha = 0.1$.

E3.5.4 Let $f(x) = x^2$. Show that $f$ is $L$-smooth for some $L > 0$.

E3.5.5 Let $f(x) = x^2$. Perform one step of gradient descent starting from $x_0 = 2$ with step size $\alpha = 0.1$.

E3.5.6 Consider the $2$-smooth function $f(x) = x^2$. Starting from $x_0 = 3$, compute the point $x_1$ obtained after one iteration of gradient descent with step size $\alpha = \frac{1}{2}$.

E3.5.7 Verify that the function $f(x) = 2x^2 + 1$ is $4$-strongly convex.

E3.5.8 For the $4$-strongly convex function $f(x) = 2x^2 + 1$ with global minimizer $x^* = 0$, compute $\frac{1}{2m} |f'(1)|^2$ and compare it with $f(1) - f(x^*)$.

E3.5.9 Let $f(x) = x^4$. Is $f$ $L$-smooth for some $L > 0$? Justify your answer.

E3.5.10 Let $f(x) = x^3$. Is $f$ $m$-strongly convex for some $m > 0$? Justify your answer.

E3.5.11 Let $f(x) = x^2 + 2x + 1$. Show that $f$ is $m$-strongly convex for some $m > 0$.

E3.5.12 Let $f(x) = x^2 + 2x + 1$. Perform one step of gradient descent starting from $x_0 = -3$ with step size $\alpha = 0.2$.

E3.5.13 Given the function $f(x, y) = x^2 + y^2$, perform two steps of gradient descent starting at $(1, 1)$ with a step size $\alpha = 0.1$.

E3.5.14 For a quadratic function $f(x) = ax^2 + bx + c$ with $a > 0$, derive the value of $\alpha$ under which gradient descent will converge to the minimum in one step.

Section 3.6

E3.6.1 Compute the log-odds of an event with probability $p = 0.25$.

E3.6.2 Given the features $\mathbf{x} = (2, -1)$ and the coefficients $\boldsymbol{\alpha} = (0.5, -0.3)$, compute the linear combination $\boldsymbol{\alpha}^T \mathbf{x}$.

E3.6.3 For the data point $\mathbf{x} = (1, 3)$ with the label $b = 1$ and the coefficients $\boldsymbol{\alpha} = (-0.2, 0.4)$, calculate the probability $p(\mathbf{x}; \boldsymbol{\alpha})$ using the sigmoid function.

E3.6.4 Compute the cross-entropy loss for a single data point $\mathbf{x} = (1, 2)$ with label $b = 0$, given $\boldsymbol{\alpha} = (0.3, -0.4)$.

E3.6.5 Prove that the logistic function $\sigma(z) = \frac{1}{1 + e^{-z}}$ satisfies $\sigma'(z) = \sigma(z)(1 - \sigma(z))$.

E3.6.6 Given the feature vectors $\boldsymbol{\alpha}_1 = (1, 2)$, $\boldsymbol{\alpha}_2 = (-1, 1)$, $\boldsymbol{\alpha}_3 = (0, -1)$, and the corresponding labels $b_1 = 1$, $b_2 = 0$, $b_3 = 1$, compute the logistic regression objective function $\ell(\mathbf{x}; A, \mathbf{b})$ for $x = (1, 1)$.

E3.6.7 For the same data as in E3.6.6, compute the gradient of the logistic regression objective function $\nabla \ell(\mathbf{x}; A, b)$ at $\mathbf{x} = (1, 1)$.

E3.6.8 For the same data as in E3.6.6, compute the Hessian $\mathbf{H}_\ell(\mathbf{x}; A, \mathbf{b})$ of the logistic regression objective function at $\mathbf{x} = (1, 1)$.

E3.6.9 For the same data as in E3.6.6, perform one step of gradient descent with step size $\beta = 0.1$ starting from $\mathbf{x}^0 = (0, 0)$.

E3.6.10 For the same data as in E3.6.6, compute the smoothness parameter $L$ of the logistic regression objective function.

3.7.2. Problems#

3.1 Let $f : \mathbb{R} \to\mathbb{R}$ be twice continuously differentiable, let $\mathbf{a}_1, \mathbf{a}_2$ be vectors in $\mathbb{R}^d$, and let $b_1, b_2 \in \mathbb{R}$. Consider the following real-valued function of $\mathbf{x} \in \mathbb{R}^d$:

\[ g(\mathbf{x}) = \frac{1}{2} f(\mathbf{a}_1^T \mathbf{x} + b_1) + \frac{1}{2} f(\mathbf{a}_2^T \mathbf{x} + b_2). \]

a) Compute the gradient of $g$ in vector form in terms of the derivative $f'$ of $f$. (By vector form, we mean that it is not enough to write down each element of $\nabla g(\mathbf{x})$ separately.)

b) Compute the Hessian of $g$ in matrix form in terms of the first derivative $f'$ and second derivative $f''$ of $f$. (By matrix form, we mean that it is not enough to write down each element of $\mathbf{H}_g (\mathbf{x})$ or each of its columns separately.)

3.2 Give an alternative proof of the Descent Direction and Directional Derivative Lemma using the definition of the directional derivative. [Hint: Mimic the proof of the single-variable case.] $\lhd$

3.3 (Adapted from [CVX]) Show that if $S_1$ and $S_2$ are convex sets in $\mathbb{R}^{m+n}$, then so is their partial sum

\[ S = \{(\mathbf{x}, \mathbf{y}_1+\mathbf{y}_2) \,|\, \mathbf{x} \in \mathbb{R}^m, \mathbf{y}_1, \mathbf{y}_2 \in \mathbb{R}^n, (\mathbf{x},\mathbf{y}_1) \in S_1, (\mathbf{x},\mathbf{y}_2)\in S_2\}. \]

$\lhd$

3.4 A convex combination of $\mathbf{z}_1, \ldots, \mathbf{z}_m \in \mathbb{R}^d$ is a linear combination of the form

\[ \mathbf{w} = \sum_{i=1}^m \alpha_i \mathbf{z}_i \]

where $\alpha_i \geq 0$ for all $i$ and $\sum_{i=1}^m \alpha_i = 1$. Show that a set is convex if and only if it contains all convex combinations of its elements. [Hint: Use induction on $m$.] $\lhd$

3.5 A set $C \subseteq \mathbb{R}^d$ is a cone if, for all $\mathbf{x} \in C$ and all $\alpha \geq 0$, $\alpha \mathbf{x} \in C$. A conic combination of $\mathbf{z}_1, \ldots, \mathbf{z}_m \in \mathbb{R}^d$ is a linear combination of the form

\[ \mathbf{w} = \sum_{i=1}^m \alpha_i \mathbf{z}_i \]

where $\alpha_i \geq 0$ for all $i$. Show that a set $C$ is a convex cone if and only if it contains all conic combinations of its elements. $\lhd$

3.6 Show that any linear subspace of $\mathbb{R}^d$ is convex. $\lhd$

3.7 Show that the set

\[ \mathbb{R}^d_+ = \{\mathbf{x} \in \mathbb{R}^d\,:\, \mathbf{x} \geq \mathbf{0}\} \]

is convex. $\lhd$

3.8 Let $f : \mathbb{R}^d \to \mathbb{R}$ be a strictly convex function. Assume further that there is a global minimizer $\mathbf{x}^*$. Show that it is unique. $\lhd$

3.9 Show that $\mathbf{S}_+^n$ is a convex cone. $\lhd$

3.10 Prove (a)-(e) from the Operations that Preserve Convexity Lemma. $\lhd$

3.11 Let $f \,:\, \mathbb{R}^d \to \mathbb{R}$ be a function. The eipgraph of $f$ is the set

\[ \mathbf{epi} f = \{(\mathbf{x}, y)\,:\, \mathbf{x} \in \mathbb{R}^d, y \geq f(\mathbf{x})\}. \]

Show that $f$ is convex if and only if $\mathbf{epi} f$ is a convex set. $\lhd$

3.12 Let $f_1, \ldots, f_m : \mathbb{R}^d \to \mathbb{R}$ be convex functions and let $\beta_1, \ldots, \beta_m \geq 0$. Show that

\[ f(\mathbf{x}) = \sum_{i=1}^m \beta_i f_i(\mathbf{x}) \]

is convex. $\lhd$

3.13 Let $f_1, f_2 : \mathbb{R}^d \to \mathbb{R}$ be convex functions. Show that the pointwise maximum function

\[ f(\mathbf{x}) = \max\{f_1(\mathbf{x}), f_2(\mathbf{x})\} \]

is convex. [Hint: First show that $\max\{\alpha + \beta, \eta + \phi\} \leq \max\{\alpha, \eta\} + \max\{\beta, \phi\}$.]$\lhd$

3.14 Prove the following composition theorem: if $f : \mathbb{R}^d \to \mathbb{R}$ is convex and $g : \mathbb{R} \to \mathbb{R}$ is convex and nondecreasing, then the composition $h = g \circ f$ is convex. $\lhd$

3.15 Show that $f(x) = e^{\beta x}$, where $x, \beta \in \mathbb{R}$, is convex. $\lhd$

3.16 Let $f : \mathbb{R} \to \mathbb{R}$ be twice continuously differentiable. Show that $f$ is $L$-smooth if $|f''(x)| \leq L$ for all $x \in \mathbb{R}$. $\lhd$

3.17 For $a \in [-1,1]$ and $b \in \{0,1\}$, let $\hat{f}(x, a) = \sigma(x a)$ where

\[ \sigma(t) = \frac{1}{1 + e^{-t}} \]

be a classifier parametrized by $x \in \mathbb{R}$. For a dataset $a_i \in [-1,1]$ and $b_i \in \{0,1\}$, $i=1,\ldots,n$, let the cross-entropy loss be

\[ \mathcal{L}(x, \{(a_i, b_i)\}_{i=1}^n) = \frac{1}{n} \sum_{i=1}^n \ell(x, a_i, b_i) \]

where

\[ \ell(x, a, b) = - b \log(\hat{f}(x, a)) - (1-b) \log(1 - \hat{f}(x, a)). \]

a) Show that $\sigma'(t) = \sigma(t)(1- \sigma(t))$ for all $t \in \mathbb{R}$.

b) Use (a) to show that

\[ \frac{\partial}{\partial x} \mathcal{L}(x, \{(a_i, b_i)\}_{i=1}^n) = - \frac{1}{n} \sum_{i=1}^n (b_i - \hat{f}(x,a_i)) \,a_i. \]

c) Use (b) to show that

\[ \frac{\partial^2}{\partial x^2} \mathcal{L}(x, \{(a_i, b_i)\}_{i=1}^n) = \frac{1}{n} \sum_{i=1}^n \hat{f}(x,a_i) \,(1 - \hat{f}(x,a_i)) \,a_i^2. \]

d) Use © to show that, for any dataset $\{(a_i, b_i)\}_{i=1}^n$, $\mathcal{L}$ is convex and $1$-smooth as a function of $x$. $\lhd$

3.18 Prove the Quadratic Bound for Strongly Convex Functions. [Hint: Adapt the proof of the Quadratic Bound for Smooth Functions.] $\lhd$

3.19 Show that $\log x \leq x - 1$ for all $x > 0$. [Hint: Compute the derivative of $s(x) = x - 1 - \log x$ and the value $s(1)$.] $\lhd$

3.20 Consider the affine function

\[ f(\mathbf{x}) = \mathbf{q}^T \mathbf{x} + r \]

where $\mathbf{x} = (x_1, \ldots, x_d), \mathbf{q} = (q_1, \ldots, q_d) \in \mathbb{R}^d$ and $r \in \mathbb{R}$. Fix a value $c \in \mathbb{R}$. Assume that

\[ f(\mathbf{x}_0) = f(\mathbf{x}_1) = c. \]

Show that

\[ \nabla f(\mathbf{x}_0)^T(\mathbf{x}_1 - \mathbf{x}_0) = 0. \]

$\lhd$

3.21 Let $f : D_1 \to \mathbb{R}$, where $D_1 \subseteq \mathbb{R}^d$, and let $\mathbf{g} : D_2 \to \mathbb{R}^d$, where $D_2 \subseteq \mathbb{R}$. Assume that $f$ is continuously differentiable at $\mathbf{g}(t_0)$, an interior point of $D_1$, and that $\mathbf{g}$ is continuously differentiable at $t_0$, an interior point of $D_2$. Assume further that there is a value $c \in \mathbb{R}$ such that

\[ f \circ \mathbf{g}(t) = c, \qquad \forall t \in D_2. \]

Show that $\mathbf{g}'(t_0)$ is orthogonal to $\nabla f(\mathbf{g}(t_0))$. $\lhd$

3.22 Fix a partition $C_1,\ldots,C_k$ of $[n]$. Under the $k$-means objective, its cost is

\[ \mathcal{G}(C_1,\ldots,C_k) = \min_{\boldsymbol{\mu}_1,\ldots,\boldsymbol{\mu}_k \in \mathbb{R}^d} G(C_1,\ldots,C_k; \boldsymbol{\mu}_1, \ldots, \boldsymbol{\mu}_k) \]

where

\[\begin{align*} G(C_1,\ldots,C_k; \boldsymbol{\mu}_1, \ldots, \boldsymbol{\mu}_k) &= \sum_{i=1}^k \sum_{j \in C_i} \|\mathbf{x}_j - \boldsymbol{\mu}_i\|^2 = \sum_{j=1}^n \|\mathbf{x}_j - \boldsymbol{\mu}_{c(j)}\|^2, \end{align*}\]

with $\boldsymbol{\mu}_i \in \mathbb{R}^d$, the center of cluster $C_i$.

a) Suppose $\boldsymbol{\mu}_i^*$ is a global minimizer of

\[ F_i(\boldsymbol{\mu}_i) = \sum_{j \in C_i} \|\mathbf{x}_j - \boldsymbol{\mu}_i\|^2, \]

for each $i$. Show that $\boldsymbol{\mu}_1^*, \ldots, \boldsymbol{\mu}_k^*$ is a global minimizer of $G(C_1,\ldots,C_k; \boldsymbol{\mu}_1, \ldots, \boldsymbol{\mu}_k)$.

b) Compute the gradient of $F_i(\boldsymbol{\mu}_i)$.

c) Find the stationary points of $F_i(\boldsymbol{\mu}_i)$.

d) Compute the Hessian of $F_i(\boldsymbol{\mu}_i)$.

e) Show that $F_i(\boldsymbol{\mu}_i)$ is convex.

f) Compute the global minimizer of $G(C_1,\ldots,C_k; \boldsymbol{\mu}_1, \ldots, \boldsymbol{\mu}_k)$. Justify your answer. $\lhd$

3.23 Consider the quadratic function

\[ f(\mathbf{x}) = \frac{1}{2} \mathbf{x}^T P \mathbf{x} + \mathbf{q}^T \mathbf{x} + r \]

where $P$ is symmetric. Show that, if $P$ is positive definite, then $f$ is strongly convex. $\lhd$

3.24 A closed halfspace is a set of the form

\[ K = \{\mathbf{x} \in \mathbb{R}^d\,:\,\mathbf{a}^T \mathbf{x} \leq b\}, \]

for some $\mathbf{a} \in \mathbb{R}^d$ and $b \in \mathbb{R}$.

a) Show that $K$ is convex.

b) Show that there exists $\mathbf{x}_0 \in \mathbb{R}^d$ such that

\[ K = \{\mathbf{x} \in \mathbb{R}^d\,:\,\mathbf{a}^T (\mathbf{x}-\mathbf{x}_0) \leq 0\}. \]

$\lhd$

3.25 A hyperplane is a set of the form

\[ H = \{\mathbf{x} \in \mathbb{R}^d\,:\,\mathbf{a}^T \mathbf{x} = b\}, \]

for some $\mathbf{a} \in \mathbb{R}^d$ and $b \in \mathbb{R}$.

a) Show that $H$ is convex.

b) Assume $b = 0$. Compute $H^\perp$ and justify your answer. $\lhd$

3.26 Recall that the $\ell^1$-norm is defined as

\[ \|\mathbf{x}\|_1 = \sum_{i=1}^d |x_i|. \]

a) Show that the $\ell^1$-norm satisfies the triangle inequality.

b) Show that the closed $\ell^1$ ball

\[ \{\mathbf{x} \in \mathbb{R}^d\,:\,\|\mathbf{x}\|_1 \leq r\}, \]

for some $r > 0$, is convex. $\lhd$

3.27 The Lorentz cone is the set

\[ C = \left\{ (\mathbf{x},t) \in \mathbb{R}^d \times \mathbb{R}\,:\, \|\mathbf{x}\|_2 \leq t \right\}. \]

Show that $C$ is convex. $\lhd$

3.28 A polyhedron is a set of the form

\[ K = \left\{ \mathbf{x} \in \mathbb{R}^d\,:\, \mathbf{a}_i^T \mathbf{x} \leq b_i, i=1,\ldots,m, \text{ and } \mathbf{c}_j^T \mathbf{x} = d_j, j=1,\ldots,n \right\}, \]

for some $\mathbf{a}_i \in \mathbb{R}^d$, $b_i \in \mathbb{R}$, $i=1,\ldots,m$ and $\mathbf{c}_j \in \mathbb{R}^d$, $d_j \in \mathbb{R}$, $j=1,\ldots,n$. Show that $K$ is convex. $\lhd$

3.29 The probability simplex in $\mathbb{R}^d$ is the set

\[ \Delta = \left\{ \mathbf{x} \in \mathbb{R}^d\,:\, \mathbf{x} \geq \mathbf{0} \text{ and } \mathbf{x}^T \mathbf{1} = 1 \right\}. \]

Show that

$\Delta$ is convex. $\lhd$

3.30 Consider the function

\[ f^*(\mathbf{y}) = \sup_{\mathbf{x} \in \mathbf{R}^d} \left\{ \mathbf{y}^T\mathbf{x} - \|\mathbf{x}\|_2 \right\}. \]

a) Show that if $\|\mathbf{y}\|_2 \leq 1$ then $f^*(\mathbf{y}) = 0$.

b) Show that if $\|\mathbf{y}\|_2 > 1$ then $f^*(\mathbf{y}) = +\infty$. $\lhd$

3.31 Let $f, g : D \to \mathbb{R}$ for some $D \subseteq \mathbb{R}^d$ and let $\mathbf{x}_0$ be an interior point of $D$. Assume $f$ and $g$ are continuously differentiable at $\mathbf{x}_0$. Let $h(\mathbf{x}) = f(\mathbf{x}) g(\mathbf{x})$ for all $\mathbf{x}$ in $D$. Compute $\nabla h(\mathbf{x}_0)$. [Hint: You can make use of the corresponding single variable result.] $\lhd$

3.32 Let $f : D \to \mathbb{R}$ for some $D \subseteq \mathbb{R}^d$ and let $\mathbf{x}_0$ be an interior point of $D$ where $f(\mathbf{x}_0) \neq 0$. Assume $f$ is continuously differentiable at $\mathbf{x}_0$. Compute $\nabla (1/f)$. [Hint: You can make use of the corresponding single variable result without proof.] $\lhd$

3.33 Let $f, g : D \to \mathbb{R}$ for some $D \subseteq \mathbb{R}^d$ and let $\mathbf{x}_0$ be an interior point of $D$. Assume $f$ and $g$ are twice continuously differentiable at $\mathbf{x}_0$. Let $h(\mathbf{x}) = f(\mathbf{x}) g(\mathbf{x})$ for all $\mathbf{x}$ in $D$. Compute $\mathbf{H}_h (\mathbf{x}_0)$. $\lhd$

3.34 (Adapted from [Rud]) Let $f : D \to \mathbb{R}$ for some convex open set $D \subseteq \mathbb{R}^d$. Assume that $f$ is continuously differentiable everywhere on $D$ and that further $\frac{\partial f}{\partial x_1}(\mathbf{x}_0) = 0$ for all $\mathbf{x}_0 \in D$. Show that $f$ depends only on $x_2,\ldots,x_d$. $\lhd$

3.35 (Adapted from [Rud]) Let $\mathbf{g} : D \to \mathbb{R}^d$ for some open set $D \subseteq \mathbb{R}$. Assume that $\mathbf{g}$ is continuously differentiable everywhere on $D$ and that further

\[ \|\mathbf{g}(t)\|^2 = 1, \qquad \forall t \in D. \]

Show that $\mathbf{g}'(t)^T \mathbf{g}(t) = 0$ for all $t \in D$. [Hint: Use composition.] $\lhd$

3.36 (Adapted from [Khu]) Let $f : D \to \mathbb{R}$ for some open set $D \subseteq \mathbb{R}^d$. Assume that $f$ is continuously differentiable everywhere on $D$ and that further

\[ f(t \mathbf{x}) = t^n f(\mathbf{x}), \]

for any $\mathbf{x} \in D$ and any scalar $t$ such that $t \mathbf{x} \in D$. In that case, $f$ is said to be homogeneous of degree $n$. Prove that for all $\mathbf{x} \in D$ we have

\[ \mathbf{x}^T \nabla f(\mathbf{x}) = n f(\mathbf{x}). \]

$\lhd$

3.37 (Adapted from [Khu]) Apply Taylor’s Theorem in the neighborhood of $\mathbf{0}$ to the function

\[ f(x_1, x_2) = \exp(x_2 \sin x_1). \]

[Hint: You can use standard formulas for the derivatives of single-variable functions without proof.] $\lhd$

3.38 (Adapted from [Khu]) Apply Taylor’s Theorem in the neighborhood of $\mathbf{0}$ to the function

\[ f(x_1, x_2) = \cos(x_1 x_2). \]

[Hint: You can use standard formulas for the derivatives of single-variable functions without proof.] $\lhd$

3.39 (Adapted from [Khu]) Apply Taylor’s Theorem in the neighborhood of $\mathbf{0}$ to the function

\[ f(x_1, x_2, x_3) = \sin(e^{x_1} + x_2^2 + x_3^3). \]

[Hint: You can use standard formulas for the derivatives of single-variable functions without proof.] $\lhd$

3.40 (Adapted from [Khu]) Consider the function

\[ f(x_1, x_2) = (x_2 - x_1^2)(x_2 - 2 x_1^2). \]

a) Compute the gradient and Hessian of $f$ at $(0,0)$.

b) Show that $(0,0)$ is not a strict local minimizer of $f$. [Hint: Consider a parametric curve of the form $x_1 = t^\alpha$ and $x_2 = t^\beta$.]

c) Let $\mathbf{g}(t) = \mathbf{a} t$ for some nonzero vector $\mathbf{a} = (a_1,a_2) \in \mathbb{R}^2$. Show that $t=0$ is a strict local minimizer of $h(t) = f(\mathbf{g}(t))$. $\lhd$

3.41 Suppose $f : \mathbb{R}^d \to \mathbb{R}$ is convex. Let $A \in \mathbb{R}^{d \times m}$ and $\mathbf{b} \in \mathbb{R}^d$. Show that

\[ g(\mathbf{x}) = f(A \mathbf{x} + \mathbf{b}) \]

is convex over $\mathbb{R}^m$. $\lhd$

3.42 (Adapted from [CVX]) For $\mathbf{x} = (x_1,\ldots,x_n)\in\mathbb{R}^n$, let

\[ x_{[1]} \geq x_{[2]} \geq \cdots \geq x_{[n]}, \]

be the coordinates of $\mathbf{x}$ in non-increasing order. Let $1 \leq r \leq n$ be an integer and $w_1 \geq w_2 \geq \cdots \geq w_r \geq 0$. Show that

\[ f(\mathbf{x}) = \sum_{i=1}^r w_i x_{[i]}, \]

is convex. $\lhd$

3.43 For a fixed $\mathbf{z} \in \mathbb{R}^d$, let

\[ f(\mathbf{x}) = \|\mathbf{x} - \mathbf{z}\|_2^2. \]

a) Compute the gradient and Hessian of $f$.

b) Show that $f$ is strongly convex.

c) For a finite set $Z \subseteq \mathbb{R}^d$, let

\[ g(\mathbf{x}) = \max_{\mathbf{z} \in Z} \|\mathbf{x} - \mathbf{z}\|_2^2. \]

Use Problem 3.13 to show that $g$ is convex. $\lhd$

3.44 Let $S_i$, $i \in I$, be a collection of convex subsets of $\mathbb{R}^d$. Here $I$ can be infinite. Show that

\[ \bigcap_{i \in I} S_i, \]

is convex. $\lhd$

3.45 a) Let $f_i$, $i \in I$, be a collection of convex real-valued functions over $\mathbb{R}^d$. Use Problems 3.11, 3.44 to show that

\[ g(\mathbf{x}) = \sup_{i \in I} f_i(\mathbf{x}), \]

is convex.

b) Let $f : \mathbb{R}^{d+f} \to \mathbb{R}$ be a convex function and let $S \subseteq \mathbb{R}^{f}$ be a (not necessarily convex) set. Use (a) to show that the function

\[ g(\mathbf{x}) = \sup_{\mathbf{y} \in S} f(\mathbf{x},\mathbf{y}), \]

is convex. You can assume that $g(\mathbf{x}) < +\infty$ for all $\mathbf{x} \in \mathbb{R}^d$. $\lhd$

3.46 Use Problem 3.45 to show that the largest eigenvalue of a matrix is a convex function over the set of all symmetric matrices. [Hint: See the Rayleigh Quotient example. First show that $\langle \mathbf{x}, A\mathbf{x}\rangle$ is a linear function of $A$.] $\lhd$

3.47 Let $f : \mathbb{R}^d \to \mathbb{R}$ be a convex function. Show that the set of global minimizers is convex. $\lhd$

3.48 Consider the set of symmetric $n \times n$ matrices

\[ \mathbf{S}^n = \left\{ X \in \mathbb{R}^{n \times n}\,:\, X = X^T \right\}. \]

a) Show that $\mathbf{S}^n$ is a linear subspace of the vector space of $n \times n$ matrices in the sense that for all $X_1, X_2 \in \mathbf{S}^n$ and $\alpha \in \mathbb{R}$

\[ X_1 + X_2 \in \mathbf{S}^n, \]

and

\[ \alpha X_1 \in \mathbf{S}^n. \]

b) Show that there exists a spanning set of $\mathbf{S}^n$ of size $d = {n \choose 2} + n$, that is, a collection of symmetric matrices $X_1,\ldots,X_d \in \mathbf{S}^n$ such that any matrix $Y \in \mathbf{S}^n$ can be written as a linear combination

\[ Y = \sum_{i=1}^d \alpha_i X_i, \]

for some $\alpha_1,\ldots,\alpha_d \in \mathbb{R}$.

c) Show that the matrices $X_1,\ldots,X_d \in \mathbf{S}^n$ you constructed in b) are linearly independent in the sense that

\[ \sum_{i=1}^d \alpha_i X_i = \mathbf{0}_{n \times n} \quad \implies \quad \alpha_1 = \cdots = \alpha_d = 0. \]

$\lhd$

3.49 Let $f : D \to \mathbb{R}$ be a convex function over the set

\[ D = \{\mathbf{x} = (x_1,\ldots,x_d) \in \mathbb{R}^d: x_i \geq 0, \forall i\}. \]

a) Show that $D$ is convex.

b) Show that the First-Order Optimality Conditions for a Convex Functions on Convex Sets reduces to

\[ \frac{\partial f(\mathbf{x}^0)}{\partial x_i} \geq 0, \qquad \forall i \in [d] \]

and

\[ \frac{\partial f(\mathbf{x}^0)}{\partial x_i} = 0, \qquad \text{if $x_i^0 > 0$}. \]

[Hint: Choose the right $\mathbf{y}$ in the condition of the theorem.] $\lhd$

3.50 Let $f : \mathbb{R}^d \to \mathbb{R}$ be twice continuously differentiable and $m$-strongly convex with $m>0$.

a) Show that

\[ f(\mathbf{y}_n) \to +\infty, \]

along any sequence $(\mathbf{y}_n)$ such that $\|\mathbf{y}_n\| \to +\infty$ as $n \to +\infty.$

b) Show that, for any $\mathbf{x}^0$, the set

\[ C = \{\mathbf{x} \in \mathbb{R}^d\,:\,f(\mathbf{x}) \leq f(\mathbf{x}^0)\}, \]

is closed and bounded.

c) Show that there exists a unique global minimizer $\mathbf{x}^*$ of $f$. $\lhd$

3.51 Suppose that $f : \mathbb{R}^d \to \mathbb{R}$ is $L$-smooth and $m$-strongly convex with a global minimizer at $\mathbf{x}^*$. Apply gradient descent with step size $\alpha = 1/L$ started from an arbitrary point $\mathbf{x}^0$.

a) Show that $\mathbf{x}^t \to \mathbf{x}^*$ as $t \to +\infty$. [Hint: Use the Quadratic Bound for Strongly Convex Functions at $\mathbf{x}^*$.]

b) Give a bound on $\|\mathbf{x}^t - \mathbf{x}^*\|$ depending on $t, m, L, f(\mathbf{x}^0)$ and $f(\mathbf{x}^*)$.

$\lhd$

3.52 Let $f : \mathbb{R}^d \to \mathbb{R}$ be twice continuously differentiable.

a) Show that

\[ \nabla f(\mathbf{y}) = \nabla f(\mathbf{x}) + \int_{0}^1 \mathbf{H}_f(\mathbf{x} + \xi \mathbf{p}) \,\mathbf{p} \,\mathrm{d}\xi, \]

where $\mathbf{p} = \mathbf{y} - \mathbf{x}$. Here the integral of a vector-valued function is performed entry-wise. [Hint: Let $g_i(\xi) = \frac{\partial}{\partial x_i} f(\mathbf{x} + \xi \mathbf{p})$. The fundamental theorem of calculus implies that $g_i(1) - g_i(0) = \int_{0}^1 g_i'(\xi) \,\mathrm{d}\xi.$]

b) Assume $f$ is $L$-smooth. Use (a) to show that the gradient of $f$ is $L$-Lipschitz in the sense that

\[ \|\nabla f(\mathbf{y}) - \nabla f(\mathbf{x})\|_2 \leq L \|\mathbf{y} - \mathbf{x}\|_2, \qquad \forall \mathbf{x}, \mathbf{y} \in \mathbb{R}^d. \]

$\lhd$

3.53 Consider the quadratic function

\[ f(\mathbf{x}) = \frac{1}{2} \mathbf{x}^T P \mathbf{x} + \mathbf{q}^T \mathbf{x} + r, \]

where $P$ is a symmetric matrix. Use the First-Order Convexity Condition to show that $f$ is convex if and only if $P$ is positive semidefinite. $\lhd$

3.54 Let

\[ \ell(\mathbf{x}; A, \mathbf{b}) = \frac{1}{n} \sum_{i=1}^n \left\{- b_i \log(\sigma(\boldsymbol{\alpha}_i^T \mathbf{x})) - (1-b_i) \log(1- \sigma(\boldsymbol{\alpha}_i^T \mathbf{x}))\right\}. \]

be the loss function in logistic regression, where $A \in \mathbb{R}^{n \times d}$ has rows $\boldsymbol{\alpha}_i^T$. Show that

\[ \mathbf{H}_{\ell}(\mathbf{x}; A, \mathbf{b}) \preceq \frac{1}{4n} A^T A. \]

[Hint: Read the proof of the Smoothness of logistic regression lemma.] $\lhd$

Exercises

Contents

3.7. Exercises#

3.7.1. Warm-up worksheets#

3.7.2. Problems#