Let’s consider illustrative example of a simple function of 2 variables:

Now, let’s introduce new variables or , where . The same function, written in the new coordinates, is

Let’s summarize what happened:

  • We have a transformation of a vector space described by a coordinate transformation matrix B.
  • Coordinate vectors transforms as .
  • However, the partial gradient of a function w.r.t. the coordinates transforms as .
  • Therefore, there seems to exist one type of mathematical objects (e.g. coordinate vectors) which transform with , and a second type of mathematical objects (e.g. the partial gradient of a function w.r.t. coordinates) which transform with .

These two types are called contra-variant and co-variant. This should at least tell us that indeed the so-called “gradient-vector” is somewhat different to a “normal vector”: it behaves inversely under coordinate transformations.

Nice thing here is that steepest descent direction on a sphere transforms as a covariant vector, since :

Steepest descent in distribution space

Suppose, we have a probabilistic model represented by its likelihood . We want to maximize this likelihood function to find the most likely parameter with given observations. Equivalent formulation would be to minimize the loss function , which is the negative logarithm of likelihood function.




Open In Colab