Summary

A classical problem of function minimization is considered.

  • The bottleneck (for almost all gradient methods) is choosing step-size, which can lead to the dramatic difference in method’s behavior.
  • One of the theoretical suggestions: choosing stepsize inversly proportional to the gradient Lipschitz constant
  • In huge-scale applications the cost of iteration is usually defined by the cost of gradient calculation (at least )
  • If function has Lipschitz-continious gradient, then method could be rewritten as follows:

Intuition

Direction of local steepest descent

Let’s consider a linear approximation of the differentiable function along some direction :

We want to be a decreasing direction:

and going to the limit at :

Also from Cauchy–Bunyakovsky–Schwarz inequality:

Thus, the direction of the antigradient

gives the direction of the steepest local decreasing of the function .

The result of this method is

Gradient flow ODE

Let’s consider the following ODE, which is referred as Gradient Flow equation.

and discretize it on a uniform grid with step:

where and - is the grid step.

From here we get the expression for

which is exactly gradient descent.

Necessary local minimum condition

This is, surely, not a proof at all, but some kind of intuitive explanation.

Minimizer of Lipschitz parabola

Some general highlights about Lipschitz properties are needed for explanation. If a function is continuously differentiable and its gradient satisfies Lipschitz conditions with constant , then :

which geometrically means, that if we’ll fix some point and define two parabolas:

Then

Now, if we have global upper bound on the function, in a form of parabola, we can try to go directly to its minimum.

This way leads to the stepsize choosing. However, often the constant is not known.

But if the function is twice continuously differentiable and its gradient has Lipschitz constant , we can derive a way to estimate this constant :

or

Stepsize choosing strategies

Stepsize choosing strategy significantly affects convergence. General Line search algorithms might help in choosing scalar parameter.

Constant stepsize

For :

With choosing , we have:

Fixed sequence

The latter 2 strategies are the simplest in terms of implementation and analytical analysis. It is clear that this approach does not often work very well in practice (the function geometry is not known in advance).

Exact line search aka steepest descent

More theoretical than practical approach. It also allows you to analyze the convergence, but often exact line search can be difficult if the function calculation takes too long or costs a lot.

Interesting theoretical property of this method is that each following iteration is orthogonal to the previous one:

Optimality conditions:

Goldstein-Armijo

This strategy of inexact line search works well in practice, as well as it has the following geometric interpretation:

Let’s consider the following scalar function while being at a specific point of :

consider first order approximation of :

Let’s consider also 2 linear scalar functions :

and

Note, that Goldstein-Armijo conditions determine the location of the function between and . Typically, we choose and ,while

Convergence analysis

Quadratic case

Bounds

Conditions $\Vert f(x_k) - f(x^*)\Vert \leq$ Type of convergence $\Vert x_k - x^* \Vert \leq$
Convex
Lipschitz-continuous function($G$)
$\mathcal{O}\left(\dfrac{1}{k} \right) \; \dfrac{GR}{k}$ Sublinear  
Convex
Lipschitz-continuous gradient ($L$)
$\mathcal{O}\left(\dfrac{1}{k} \right) \; \dfrac{LR^2}{k}$ Sublinear  
$\mu$-Strongly convex
Lipschitz-continuous gradient($L$)
  Linear $(1 - \eta \mu)^k R^2$
$\mu$-Strongly convex
Lipschitz-continuous hessian($M$)
  Locally linear
$R < \overline{R}$
$\dfrac{\overline{R}R}{\overline{R} - R} \left( 1 - \dfrac{2\mu}{L+3\mu}\right)$
  • - initial distance

Materials