Natural gradient descent

1 Intuition

Let’s consider illustrative example of a simple function of 2 variables:

f(x1,x2)=2x1+13x2,βˆ‡xf=(213) f(x_1, x_2) = 2x_1 + \frac{1}{3}x_2, \quad \nabla_x f = \begin{pmatrix} 2\\ \frac{1}{3} \end{pmatrix}

Now, let’s introduce new variables $(y_1, y_2) = (2x_1, x_2) $ or y=Bxy = Bx, where B=(20013)B = \begin{pmatrix} 2 & 0\\ 0 & \frac{1}{3} \end{pmatrix}. The same function, written in the new coordinates, is

f(y1,y2)=y1+y2,βˆ‡yf=(11) f(y_1, y_2) = y_1 + y_2, \quad \nabla_y f = \begin{pmatrix} 1\\ 1 \end{pmatrix}

Let’s summarize what happened:

  • We have a transformation of a vector space described by a coordinate transformation matrix B.
  • Coordinate vectors transforms as y=Bxy = Bx.
  • However, the partial gradient of a function w.r.t. the coordinates transforms as βˆ‚fβˆ‚y=Bβˆ’βŠ€βˆ‚fβˆ‚x\frac{\partial f}{\partial y} = B^{-\top} \frac{\partial f}{\partial x}.
  • Therefore, there seems to exist one type of mathematical objects (e.g. coordinate vectors) which transform with BB, and a second type of mathematical objects (e.g. the partial gradient of a function w.r.t. coordinates) which transform with Bβˆ’βŠ€B^{-\top}.

These two types are called contra-variant and co-variant. This should at least tell us that indeed the so-called β€œgradient-vector” is somewhat different to a β€œnormal vector”: it behaves inversely under coordinate transformations.

Nice thing here is that steepest descent direction Axβˆ’1βˆ‡xfA_x^{-1}\nabla_x f on a sphere transforms as a covariant vector, since Ay=Bβˆ’βŠ€AxBβˆ’1A_y = B^{-\top} A_x B^{-1}:

Ayβˆ’1βˆ‡yf=(Bβˆ’βŠ€AxBβˆ’1)βˆ’1Bβˆ’βŠ€βˆ‡xf=BAxβˆ’1B⊀Bβˆ’βŠ€βˆ‡xf=B(Axβˆ’1βˆ‡xf) \begin{split} A_y^{-1}\nabla_y f = \\ (B^{-\top} A_x B^{-1})^{-1} B^{-\top} \nabla_x f = \\ B A_x^{-1} B^\top B^{-\top} \nabla_x f = \\ B (A_x^{-1} \nabla_x f) \end{split}

2 Steepest descent in distribution space

Suppose, we have a probabilistic model represented by its likelihood $p(x ) $. We want to maximize this likelihood function to find the most likely parameter ΞΈ\theta with given observations. Equivalent formulation would be to minimize the loss function L(ΞΈ)\mathcal{L}(\theta), which is the negative logarithm of likelihood function.

3 Example

4 References

5 Code

Open In Colab{: .btn }