Optimization theory and algorithms

Network

Image credit: Made with Midjourney

3. Optimization theory and algorithms#

In this chapter, we turn to optimization theory and algorithms, which lie at the core of modern data science and AI. We derive basic optimality conditions, including in the presence of convexity. We then introduce a fundamental optimization algorithm, the gradient descent method, and detail a theoretical analysis of its convergence under different assumptions. We illustrate these concepts on an important supervised learning problem. Here is a more detailed overview of the main sections of the chapter.

“Background: review of differentiable functions of several variables” This section reviews the fundamental concepts of differentiable functions of several variables. The gradient is defined as a vector of partial derivatives, generalizing the concept of the derivative for functions of a single variable. The section also introduces the Hessian matrix, which contains second-order partial derivatives. The Chain Rule for functions of several variables is presented and used to prove a multivariable version of the Mean Value Theorem. Finally, the section provides examples of calculating gradients and Hessians for various functions, including affine and quadratic functions.

“Optimality conditions” This section focuses on deriving optimality conditions for unconstrained continuous optimization problems. It introduces the concept of a global minimizer, which is a point where a function attains its minimum value over its entire domain. Recognizing the difficulty of finding global minimizers, the section also defines local minimizers, points where a function takes its minimum value within a neighborhood. It introduces the concepts of first-order conditions based on the gradient and second-order conditions based on the Hessian matrix. The section explains that while the first-order necessary condition of the gradient being zero at a point is not sufficient for a local minimizer, the second-order sufficient condition involving the positive definiteness of the Hessian at a point satisfying the first-order condition guarantees a strict local minimizer. The section concludes by extending the discussion to optimization problems with equality constraints, presenting the method of Lagrange multipliers.

“Convexity” This section introduces the concept of convexity, which plays a crucial role in optimization problems. It defines convex sets and functions, and provides examples and properties of each. A set is convex if the line segment connecting any two points in the set lies entirely within the set. Convex functions are defined similarly, with the additional requirement that the function’s value at any point on the line segment is less than or equal to the corresponding weighted average of the function’s values at the endpoints. The section then establishes the connection between convexity and optimization, showing that for convex functions, any local minimizer is also a global minimizer. It also presents conditions for convexity based on the gradient and Hessian. The section concludes by introducing the notion of strong convexity, which guarantees the existence and uniqueness of global minimizers, and provides examples of strongly convex functions, such as least-squares objectives.

“Gradient descent and its convergence analysis” This section discusses gradient descent, a numerical optimization method for finding the minimum of a continuously differentiable function. The method iteratively takes steps in the direction of the negative gradient, which is proven to be the steepest descent direction. The convergence of gradient descent is analyzed under different assumptions on the function being minimized. For smooth functions, gradient descent with an appropriate step size is shown to produce a sequence of points whose objective values decrease and whose gradients vanish in the limit. When the function is both smooth and strongly convex, a faster convergence rate is obtained, with the function values converging exponentially to the global minimum. The section also includes examples and Python implementations to illustrate the concepts.

“Application: logistic regression” The section applies gradient descent to logistic regression, a method for modeling the probability of a binary outcome based on input features. It derives the gradient and Hessian of the logistic regression objective function (i.e., the cross-entropy loss), proves the objective is convex and smooth, and provides Python code to implement gradient descent for logistic regression. The method is applied to example datasets.