Natural gradient descent
1 Intuition
Let’s consider illustrative example of a simple function of 2 variables:
f(x_1, x_2) = 2x_1 + \frac{1}{3}x_2, \quad \nabla_x f = \begin{pmatrix} 2\\ \frac{1}{3} \end{pmatrix}
Now, let’s introduce new variables $(y_1, y_2) = (2x_1, x_2) $ or y = Bx, where B = \begin{pmatrix} 2 & 0\\ 0 & \frac{1}{3} \end{pmatrix}. The same function, written in the new coordinates, is
f(y_1, y_2) = y_1 + y_2, \quad \nabla_y f = \begin{pmatrix} 1\\ 1 \end{pmatrix}
Let’s summarize what happened:
- We have a transformation of a vector space described by a coordinate transformation matrix B.
- Coordinate vectors transforms as y = Bx.
- However, the partial gradient of a function w.r.t. the coordinates transforms as \frac{\partial f}{\partial y} = B^{-\top} \frac{\partial f}{\partial x}.
- Therefore, there seems to exist one type of mathematical objects (e.g. coordinate vectors) which transform with B, and a second type of mathematical objects (e.g. the partial gradient of a function w.r.t. coordinates) which transform with B^{-\top}.
These two types are called contra-variant and co-variant. This should at least tell us that indeed the so-called “gradient-vector” is somewhat different to a “normal vector”: it behaves inversely under coordinate transformations.
Nice thing here is that steepest descent direction A_x^{-1}\nabla_x f on a sphere transforms as a covariant vector, since A_y = B^{-\top} A_x B^{-1}:
\begin{split} A_y^{-1}\nabla_y f = \\ (B^{-\top} A_x B^{-1})^{-1} B^{-\top} \nabla_x f = \\ B A_x^{-1} B^\top B^{-\top} \nabla_x f = \\ B (A_x^{-1} \nabla_x f) \end{split}
2 Steepest descent in distribution space
Suppose, we have a probabilistic model represented by its likelihood $p(x ) $. We want to maximize this likelihood function to find the most likely parameter \theta with given observations. Equivalent formulation would be to minimize the loss function \mathcal{L}(\theta), which is the negative logarithm of likelihood function.
3 Example
4 References
5 Code
Open In Colab{: .btn }