Methods

1 General formulation

\begin{split} & \min_{x \in \mathbb{R}^n} f(x)\\ \text{s.t. } g_i(x) \leq& 0, \; i = 1,\ldots,m\\ h_j(x) =& 0, \; j = 1,\ldots,k\\ \end{split}

Some necessary or/and sufficient conditions are known (See Optimality conditions. KKT and Convex optimization problem.

In fact, there might be very challenging to recognize the convenient form of optimization problem.
Analytical solution of KKT could be inviable.

1.1 Iterative methods

Typically, the methods generate an infinite sequence of approximate solutions

\{x_t\},

which for a finite number of steps (or better - time) converges to an optimal (at least one of the optimal) solution x_*.

Illustration of iterative method approaches to the solution x^*

def GeneralScheme(x, epsilon):
    while not StopCriterion(x, epsilon):
        OracleResponse = RequestOracle(x)
        x = NextPoint(x, OracleResponse)
    return x

1.2 Oracle conception

Depending on the maximum order of derivative available from the oracle we call the oracles as zero order, first order, second order oravle and etc.

2 Unsolvability of numerical optimization problem

In general, optimization problems are unsolvable. ¯\(ツ)/¯

Consider the following simple optimization problem of a function over unit cube:

\begin{split} & \min_{x \in \mathbb{R}^n} f(x)\\ \text{s.t. } & x \in \mathbb{C}^n \end{split}

We assume, that the objective function f (\cdot) : \mathbb{R}^n \to \mathbb{R} is Lipschitz continuous on \mathbb{B}^n:

| f (x) − f (y) | \leq L \| x − y \|_{\infty} \forall x,y \in \mathbb{C}^n,

with some constant L (Lipschitz constant). Here \mathbb{C}^n - the n-dimensional unit cube

\mathbb{C}^n = \{x \in \mathbb{R}^n \mid 0 \leq x_i \leq 1, i = 1, \ldots, n\}

Our goal is to find such \tilde{x}: \vert f(\tilde{x}) - f^*\vert \leq \varepsilon for some positive \varepsilon. Here f^* is the global minima of the problem. Uniform grid with p points on each dimension guarantees at least this quality:

\| \tilde{x} − x_* \|_{\infty} \leq \frac{1}{2p},

which means, that

|f (\tilde{x}) − f (x_*)| \leq \frac{L}{2p}

Our goal is to find the p for some \varepsilon. So, we need to sample \left(\frac{L}{2 \varepsilon}\right)^n points, since we need to measure function in p^n points. Doesn’t look scary, but if we’ll take L = 2, n = 11, \varepsilon = 0.01, computations on the modern personal computers will take 31,250,000 years.

2.1 Stopping rules

Argument closeness:

\| x_k - x_* \|_2 < \varepsilon
Function value closeness:

\| f_k - f^* \|_2 < \varepsilon
Closeness to a critical point

\| f'(x_k) \|_2 < \varepsilon

But x_* and f^* = f(x_*) are unknown!

Sometimes, we can use the trick:

\|x_{k+1} - x_k \| = \|x_{k+1} - x_k + x_* - x_* \| \leq \|x_{k+1} - x_* \| + \| x_k - x_* \| \leq 2\varepsilon

Note: it’s better to use relative changing of these values, i.e. \dfrac{\|x_{k+1} - x_k \|_2}{\| x_k \|_2}.

Example

Suppose, you are trying to estimate the vector x_{true} with some approximation x_{approx}. One can choose between two relative errors:

\dfrac{\|x_{approx} - x_{true}\|}{\|x_{approx}\|} \quad \dfrac{\|x_{approx} - x_{true}\|}{\|x_{true}\|}

If both x_{approx} and x_{true} are close to each other, then the difference between them is small, while if your approximation is far from the truth (say, x_{approx} = 10x_{true} or x_{approx} = 0.01 x_{true} they differ drastically).

2.2 Local nature of the methods

Illustration of the idea of locality in black-box optimization

Methods

1 General formulation

1.1 Iterative methods

1.2 Oracle conception

2 Unsolvability of numerical optimization problem

2.1 Stopping rules

2.2 Local nature of the methods

3 Contents of the chapter

Gradient descent

Newton method

Heavy Ball Method

Quasi Newton methods

Subgradient descent

Projected subgradient descent

Linear Programming and simplex algorithm

Mirror descent

Automatic differentiation

Stochastic gradient descent

Stochastic average gradient

ADAM: A Method for Stochastic Optimization

Lookahead Optimizer: k steps forward, 1 step back

Proximal gradient method

Bee algorithm

Binary search

Conjugate gradients

Golden search

Inexact Line Search

Natural gradient descent

Nelder–Mead

Simulated annealing

Successive parabolic interpolation