Useful definitions and notations

We will treat all vectors as column vectors by default.

Matrix and vector multiplication

Let be , and be , and let the product be

then is a matrix, with element $(i, j)$ given by

Let be , and be , then the typical element of the product

is given by

Finally, just to remind:

Gradient

Gradient Let , then vector, which contains all first order partial derivatives:

Hessian

Let , then matrix, containing all the second order partial derivatives:

But actually, Hessian could be a tensor in such a way: is just 3d tensor, every slice is just hessian of corresponding scalar function

Jacobian

The extension of the gradient of multidimensional :

Summary

X Y G Name
$\mathbb{R}$ $\mathbb{R}$ $\mathbb{R}$ $f’(x)$ (derivative)
$\mathbb{R}^n$ $\mathbb{R}$ $\mathbb{R^n}$ $\dfrac{\partial f}{\partial x_i}$ (gradient)
$\mathbb{R}^n$ $\mathbb{R}^m$ $\mathbb{R}^{n \times m}$ $\dfrac{\partial f_i}{\partial x_j}$ (jacobian)
$\mathbb{R}^{m \times n}$ $\mathbb{R}$ $\mathbb{R}^{m \times n}$ $\dfrac{\partial f}{\partial x_{ij}}$

named gradient of . This vector indicates the direction of steepest ascent. Thus, vector means the direction of the steepest descent of the function in the point. Moreover, the gradient vector is always orthogonal to the contour line in the point.

General concept

Naive approach

The basic idea of naive approach is to reduce matrix\vector derivatives to the well-known scalar derivatives. One of the most important practical trick here is to separate indicies of sum () and partial derivatives (). Ignoring this simple rule tends to produce mistakes.

Guru approach

The guru approach implies formulating a set of simple rules, which allows you to calculate derivatives just like in a scalar case. It might be convinient to use the differential notation here.

Differentials

After obtaining the differential notaion of $df$ we can retrieve the gradient using following formula:

Than, if we have differential of the above form and we need to calculate the second derivative of the matrix\vector function, we treat “old” as the constant , than calculate

Properties

Let and be the constant matrices, while and are the variables (or matrix functions).

References