# Useful definitions and notations

We will treat all vectors as column vectors by default.

## Matrix and vector multiplication

Let $A$ be $m \times n$, and $B$ be $n \times p$, and let the product $AB$ be

then $C$ is a $m \times p$ matrix, with element $(i, j)$ given by

Let $A$ be $m \times n$, and $x$ be $n \times 1$, then the typical element of the product

is given by

Finally, just to remind:

• $C = AB \quad C^\top = B^\top A^\top$
• $AB \neq BA$
• $e^{A} =\sum\limits_{k=0}^{\infty }{1 \over k!}A^{k}$
• $e^{A+B} \neq e^{A} e^{B}$

Gradient Let $f(x):\mathbb{R}^n→\mathbb{R}$, then vector, which contains all first order partial derivatives:

## Hessian

Let $f(x):\mathbb{R}^n→\mathbb{R}$, then matrix, containing all the second order partial derivatives:

But actually, Hessian could be a tensor in such a way: $\left(f(x): \mathbb{R}^n \to \mathbb{R}^m \right)$ is just 3d tensor, every slice is just hessian of corresponding scalar function $\left( H\left(f_1(x)\right), H\left(f_2(x)\right), \ldots, H\left(f_m(x)\right)\right)$

## Jacobian

The extension of the gradient of multidimensional $f(x):\mathbb{R}^n→\mathbb{R}^m$ :

## Summary

X Y G Name
$\mathbb{R}$ $\mathbb{R}$ $\mathbb{R}$ $f’(x)$ (derivative)
$\mathbb{R}^n$ $\mathbb{R}$ $\mathbb{R^n}$ $\dfrac{\partial f}{\partial x_i}$ (gradient)
$\mathbb{R}^n$ $\mathbb{R}^m$ $\mathbb{R}^{n \times m}$ $\dfrac{\partial f_i}{\partial x_j}$ (jacobian)
$\mathbb{R}^{m \times n}$ $\mathbb{R}$ $\mathbb{R}^{m \times n}$ $\dfrac{\partial f}{\partial x_{ij}}$

named gradient of $f(x)$ . This vector indicates the direction of steepest ascent. Thus, vector $−\nabla f(x)$ means the direction of the steepest descent of the function in the point. Moreover, the gradient vector is always orthogonal to the contour line in the point.

# General concept

## Naive approach

The basic idea of naive approach is to reduce matrix\vector derivatives to the well-known scalar derivatives. One of the most important practical trick here is to separate indices of sum ($i$) and partial derivatives ($k$). Ignoring this simple rule tends to produce mistakes.

## Guru approach

The guru approach implies formulating a set of simple rules, which allows you to calculate derivatives just like in a scalar case. It might be convenient to use the differential notation here.

### Differentials

After obtaining the differential notation of $df$ we can retrieve the gradient using following formula:

Then, if we have differential of the above form and we need to calculate the second derivative of the matrix\vector function, we treat “old” $dx$ as the constant $dx_1$, then calculate $d(df)$

### Properties

Let $A$ and $B$ be the constant matrices, while $X$ and $Y$ are the variables (or matrix functions).

• $dA = 0$
• $d(\alpha X) = \alpha (dX)$
• $d(AXB) = A(dX )B$
• $d(X+Y) = dX + dY$
• $d(X^\top) = (dX)^\top$
• $d(XY) = (dX)Y + X(dY)$
• $d\langle X, Y\rangle = \langle dX, Y\rangle+ \langle X, dY\rangle$
• $d\left( \dfrac{X}{\phi}\right) = \dfrac{\phi dX - (d\phi) X}{\phi^2}$
• $d\left( \det X \right) = \det X \langle X^{-\top}, dX \rangle$
• $d \text{tr } X = \langle I, dX\rangle$
• $df(g(x)) = \dfrac{df}{dg} \cdot dg(x)$
• $H = (J(\nabla f))^T$
• $d(X^{-1})=-X^{-1}(dX)X^{-1}$