Theory: Convexity
Definition
A set \(D\) is called convex if all its elements can be connected by straight lines that lie entirely within \(D\). Specifically, this means that for all \(x_1, x_2 \in D\) and \(y = \theta x_1 + (1-\theta)x_2\) with \(\theta \in [0,1]\), \(y\) is an element of \(D\).
Both \(x_1, x_2\) and \(y\) are vectors of arbitrary dimensions. Convex sets are simple objects in the sense that any point can be reached from any other point in a straightforward manner. A function \(f:D \rightarrow \mathbb{R}\) is called convex if \(D\) is convex and every linear approximation to \(f\) lies entirely above \(f\). Formally, this means that for all \(x_1, x_2 \in D\) and all \(\theta \in [0,1]\):
$$ \theta f(x_1) + (1-\theta)f(x_2) \ge f(\theta x_1 + (1-\theta) x_2) $$
The fact that the value of a convex \(f\) at any point \(x\) always lies below the average value of its neighbors implies the absence of hill-like shapes in the function landscape of \(f\) and a consistently upward (=positive) curvature. A comprehensive source on the convexity of sets and functions is [1, pp. 23–90].
Relevance
Convex sets and convex functions ensure unique and practically feasible minimizability. This is significant, as optimization problems are typically unsolvable [2, p.4]. As shown in the figure above, the presence of multiple minima implies that a function is not convex. Conversely, a convex \(f\) has only one local minimum.
Additionally, due to the convexity of \(f\), global properties can be inferred from local characteristics. Assuming \(f\) is differentiable, the following statements hold:
\(f(x_2) \ge f(x_1) + \nabla f(x_1)^T[x_2-x_1]\)
\(\Delta f(x)\succeq 0\)
for all \(x_1\) and \(x_2\), where \(\nabla f\) is the gradient (vector of first derivatives) and \( \Delta f\) is the Hessian matrix (matrix of second derivatives) of \(f\). A convex differentiable function is always non-negatively curved and is bounded below by its linear approximation. Consequently, the global minimum is at the point \(x^*\), where \(\nabla f(x)=0\). This follows from \(f(x_1)\ge f(x^*) + \underbrace{\nabla f(x^*)[x^*-x_2]}_{=0} \ge f(x^*)\).
Solution approach
The search for zeros of \(\nabla f\) leads directly to the method illustrated on the right in the above figure: Start at any point \(x_0\), use the value \(\nabla f(x_0)\) of the gradient and its rate of change \(\Delta f (x_0)\) to estimate the zero. With the estimator \(x_1\) for the zero, repeat the procedure until the actual zero \(x^*\) is found.
From \(\nabla f(x_0) + \Delta f[x_1 — x_0] \stackrel{!}{=} 0\), the sequence of estimators derived from \(x_0\) is given by $$ x_{k+1} = x_k — \Delta f^{-1}(x_k) \nabla f(x_k) ~~~~ k=0,1, … $$
This method is also known as the Newton method. Due to positive curvature, each point \(x_{k+1}\) has a lower function value \(f(x_{k+1})\) than its predecessor \(x_k\) and the method terminates at the optimum \(x^*\) [1, p. 484]. It can also be modified for optimization problems with constraints. More on this here.
Example
Many important functions are convex and can therefore be efficiently optimized. The examples in the following non-exclusive list are taken from [1, pp. 71–73].
Name | Function | Domain | Application in modelling |
---|---|---|---|
Exponential function | \(e^{ax}\) | \(x\in \mathbb{R}, a \in \mathbb{R}\) | Growth processes and failure probabilities |
Neg. logarithm | \(-\log x\) | \(x > 0\) | Probability distributions |
Linear functional | \(c^T x\) | \(x\in \mathbb{R}^n, c \in \mathbb{R}^n\) | Combination of effects |
Norm | \(\|x\|_p\) | \(x \in \mathbb{R}^n, p \ge 1\) | Discrepancies and distances |
Neg. entropy | \(-\sum_{k=1}^n p(x_k)\log p(x_k) \) | \(p(x_k)> 0\) | Comparison probability distributions |
Neg. geom. mean | \( — \left(\Pi_{k=1}^n x_k\right)^{1/n}\) | \( x\in \mathbb{R}^n, x_k >0\) | Growth processes and indicator functions |
Maximum | \(\max (x_1, …, x_n)\) | \( x\in \mathbb{R}^n\) | Unproportional impacts |
Softmax | \(\log \left( \sum_{k=1}^n e^{x_k}\right)\) | \( x\in \mathbb{R}^n\) | Machine learning |
Neg. logdet | \(- \log \det X\) | \(X\) positiv definite | Covariance matrices and energies |
Sum of eigenvalues | \( \sum_{k=1}^n \lambda_k(X)\) | \(X\) positiv semidefinite | Uncertainties and distributional properties |
Matrix fractional | \(x^YP^{-1}x\) | \( x\in \mathbb{R}^n, P\) positiv semidefinite | Normal distributions and anisotropic distances |
Construction rules
The above examples can be combined to generate new, more expressive convex functions. Positively weighted linear combinations \(w_1f_1(x)+w_2f_2(x)\) of convex functions \(f_1(x), f_2(x)\) are convex; the same applies to the maximum \(\max( f_1(x),f_2(x))\) or the composition with a linear transformation to \(f_1(Ax+b)\). Modern modeling frameworks like CVXPY [3] leverage these relationships. Starting from demonstrably convex example functions and convexity-preserving composition rules, even seemingly complex functions can often be reduced to combinations of elementary convex functions and then solved [4].
Sources
[1] Boyd, S., & Vandenberghe, L. (2004). Convex Optimization. Cambridge: Cambridge University Press.
[2] Nesterov, Y. (2013). Introductory Lectures on Convex Optimization: A Basic Course. Berlin Heidelberg: Springer Science & Business Media.
[3] Diamond S., & Boyd S., (2016). CVXPY: A Python-embedded modeling language for convex optimization. Journal of Machine Learning Research, 17(83), 1–5.