Quasi Newton methods

1 Intuition

For the classic task of unconditional optimization f(x) \to \min\limits_{x \in \mathbb{R}^n} the general scheme of iteration method is written as:

x_{k+1} = x_k + \alpha_k s_k

In the Newton method, the s_k direction (Newton’s direction) is set by the linear system solution at each step:

s_k = - B_k\nabla f(x_k), \;\;\; B_k = f_{xx}^{-1}(x_k)

i.e. at each iteration it is necessary to compensate hessian and gradient and resolve linear system.

Note here that if we take a single matrix of B_k = I_n as B_k at each step, we will exactly get the gradient descent method.

The general scheme of quasi-Newton methods is based on the selection of the B_k matrix so that it tends in some sense at k \to \infty to the true value of inverted Hessian in the local optimum f_{xx}^{-1}(x_*). Let’s consider several schemes using iterative updating of B_k matrix in the following way:

B_{k+1} = B_k + \Delta B_k

Then if we use Taylor’s approximation for the first order gradient, we get it:

\nabla f(x_k) - \nabla f(x_{k+1}) \approx f_{xx}(x_{k+1}) (x_k - x_{k+1}).

Now let’s formulate our method as:

\Delta x_k = B_{k+1} \Delta y_k, \text{ where } \;\; \Delta y_k = \nabla f(x_{k+1}) - \nabla f(x_k)

in case you set the task of finding an update \Delta B_k:

\Delta B_k \Delta y_k = \Delta x_k - B_k \Delta y_k

2 Broyden method

The simplest option is when the amendment \Delta B_k has a rank equal to one. Then you can look for an amendment in the form

\Delta B_k = \mu_k q_k q_k^\top.

where \mu_k is a scalar and q_k is a non-zero vector. Then mark the right side of the equation to find \Delta B_k for \Delta z_k:

\Delta z_k = \Delta x_k - B_k \Delta y_k

We get it:

\mu_k q_k q_k^\top \Delta y_k = \Delta z_k

\left(\mu_k \cdot q_k^\top \Delta y_k\right) q_k = \Delta z_k

A possible solution is: q_k = \Delta z_k, \mu_k = \left(q_k^\top \Delta y_k\right)^{-1}.

Then an iterative amendment to Hessian’s evaluation at each iteration:

\Delta B_k = \dfrac{(\Delta x_k - B_k \Delta y_k)(\Delta x_k - B_k \Delta y_k)^\top}{\langle \Delta x_k - B_k \Delta y_k , \Delta y_k\rangle}.

3 Davidon–Fletcher–Powell method

\Delta B_k = \mu_1 \Delta x_k (\Delta x_k)^\top + \mu_2 B_k \Delta y_k (B_k \Delta y_k)^\top.

\Delta B_k = \dfrac{(\Delta x_k)(\Delta x_k )^\top}{\langle \Delta x_k , \Delta y_k\rangle} - \dfrac{(B_k \Delta y_k)( B_k \Delta y_k)^\top}{\langle B_k \Delta y_k , \Delta y_k\rangle}.

4 Broyden–Fletcher–Goldfarb–Shanno method

\Delta B_k = Q U Q^\top, \quad Q = [q_1, q_2], \quad q_1, q_2 \in \mathbb{R}^n, \quad U = \begin{pmatrix} a & c\\ c & b \end{pmatrix}.

\Delta B_k = \dfrac{(\Delta x_k)(\Delta x_k )^\top}{\langle \Delta x_k , \Delta y_k\rangle} - \dfrac{(B_k \Delta y_k)( B_k \Delta y_k)^\top}{\langle B_k \Delta y_k , \Delta y_k\rangle} + p_k p_k^\top.

5 Code