Natural gradient descent
1 Intuition
Letβs consider illustrative example of a simple function of 2 variables:
Now, letβs introduce new variables $(y_1, y_2) = (2x_1, x_2) $ or , where . The same function, written in the new coordinates, is
Letβs summarize what happened:
- We have a transformation of a vector space described by a coordinate transformation matrix B.
- Coordinate vectors transforms as .
- However, the partial gradient of a function w.r.t. the coordinates transforms as .
- Therefore, there seems to exist one type of mathematical objects (e.g. coordinate vectors) which transform with , and a second type of mathematical objects (e.g. the partial gradient of a function w.r.t. coordinates) which transform with .
These two types are called contra-variant and co-variant. This should at least tell us that indeed the so-called βgradient-vectorβ is somewhat different to a βnormal vectorβ: it behaves inversely under coordinate transformations.
Nice thing here is that steepest descent direction on a sphere transforms as a covariant vector, since :
2 Steepest descent in distribution space
Suppose, we have a probabilistic model represented by its likelihood $p(x ) $. We want to maximize this likelihood function to find the most likely parameter with given observations. Equivalent formulation would be to minimize the loss function , which is the negative logarithm of likelihood function.
3 Example
4 References
5 Code
Open In Colab{: .btn }