Intuition

Newton’s method to find the equation’ roots

Consider the function . Let there be equation . Consider a linear approximation of the function near the solution ():

We get an approximate equation:

We can assume that the solution to equation will be close to the optimal .

We get an iterative scheme:

This reasoning can be applied to the unconditional minimization task of the function by writing down the necessary extremum condition:

Here . Thus, we get the Newton optimization method in its classic form:

With the only clarification that in the multidimensional case: .

Second order Taylor approximation of the function

Let us now give us the function and a certain point . Let us consider the square approximation of this function near :

The idea of the method is to find the point , that minimizes the function , i.e. .

Let us immediately note the limitations related to the necessity of the Hessian’s unbornness (for the method to exist), as well as his positive definiteness (for the convergence guarantee).

Convergence

Let’s try to get an estimate of how quickly the classical Newton method converges. We will try to enter the necessary data and constants as needed in the conclusion (to illustrate the methodology of obtaining such estimates).

Used here is: . Let’s try to estimate the size of :

where .

So, we have:

Quadratic convergence already smells. All that remains is to estimate the value of Hessian’s reverse.

Because of Hessian’s Lipschitz сontinuity and symmetry:

So, (here we should already limit the necessity of being for such estimations, i.e. ).

The convergence condition imposes additional conditions on

Thus, we have an important result: Newton’s method for the function with Lipschitz positive Hessian converges squarely near () to the solution with quadratic speed.

Theorem

Let be a strongly convex twice continuously differentiated function at , for the second derivative of which inequalities are executed: . Then Newton’s method with a constant step locally converges to solving the problem with super linear speed. If, in addition, Hessian is Lipschitz сontinious, then this method converges locally to with a quadratic speed.

Examples

Let’s look at some interesting features of Newton’s method. Let’s first apply it to the function

Summary

It’s nice:

  • quadratic convergence near the solution
  • affinity invariance
  • the parameters have little effect on the convergence rate

It’s not nice:

  • it is necessary to store the hessian on each iteration: memory
  • it is necessary to solve linear systems: operations
  • the Hessian can be degenerate at
  • the hessian may not be positively determined direction may not be a descending direction

Possible directions

  • Newton’s damped method (adaptice stepsize)
  • Quasi-Newton methods (we don’t calculate the Hessian, we build its estimate - BFGS)
  • Quadratic evaluation of the function by the first order oracle (superlinear convergence)
  • The combination of the Newton method and the gradient descent (interesting direction)
  • Higher order methods (most likely useless)

Code

Open In Colab