Matrix calculus
1 Basic linear algebra background
1.1 Vectors and matrices
We will treat all vectors as column vectors by default. The space of real vectors of length n is denoted by \mathbb{R}^n, while the space of real-valued m \times n matrices is denoted by \mathbb{R}^{m \times n}. That’s it: 1
x = \begin{bmatrix} x_1 \\ x_2 \\ \vdots \\ x_n \end{bmatrix} \quad x^T = \begin{bmatrix} x_1 & x_2 & \dots & x_n \end{bmatrix} \quad x \in \mathbb{R}^n, x_i \in \mathbb{R} \tag{1} Similarly, if A \in \mathbb{R}^{m \times n} we denote transposition as A^T \in \mathbb{R}^{n \times m}: A = \begin{bmatrix} a_{11} & a_{12} & \dots & a_{1n} \\ a_{21} & a_{22} & \dots & a_{2n} \\ \vdots & \vdots & \ddots & \vdots \\ a_{m1} & a_{m2} & \dots & a_{mn} \end{bmatrix} \quad A^T = \begin{bmatrix} a_{11} & a_{21} & \dots & a_{m1} \\ a_{12} & a_{22} & \dots & a_{m2} \\ \vdots & \vdots & \ddots & \vdots \\ a_{1n} & a_{2n} & \dots & a_{mn} \end{bmatrix} \quad A \in \mathbb{R}^{m \times n}, a_{ij} \in \mathbb{R} We will write x \geq 0 and x \neq 0 to indicate componentwise relationships
A matrix is symmetric if A = A^T. It is denoted as A \in \mathbb{S}^n (set of square symmetric matrices of dimension n). Note, that only a square matrix could be symmetric by definition.
A matrix A \in \mathbb{S}^n is called positive (negative) definite if for all x \neq 0 : x^T Ax > (<) 0. We denote this as A \succ (\prec) 0. The set of such matrices is denoted as \mathbb{S}^n_{++} (\mathbb{S}^n_{- -})
A matrix A \in \mathbb{S}^n is called positive (negative) semidefinite if for all x : x^T Ax \geq (\leq) 0. We denote this as A \succeq (\preceq) 0. The set of such matrices is denoted as \mathbb{S}^n_{+} (\mathbb{S}^n_{-})
1.2 Matrix and vector product
Let A be a matrix of size m \times n, and B be a matrix of size n \times p, and let the product AB be: C = AB then C is a m \times p matrix, with element (i, j) given by: c_{ij} = \sum_{k=1}^n a_{ik}b_{kj}.
This operation in a naive form requires \mathcal{O}(n^3) arithmetical operations, where n is usually assumed as the largest dimension of matrices.
Let A be a matrix of shape m \times n, and x be n \times 1 vector, then the i-th component of the product: z = Ax is given by: z_i = \sum_{k=1}^n a_{ik}x_k
This operation in a naive form requires \mathcal{O}(n^2) arithmetical operations, where n is usually assumed as the largest dimension of matrices.
Remember, that:
- C = AB \quad C^T = B^T A^T
- AB \neq BA
- e^{A} =\sum\limits_{k=0}^{\infty }{1 \over k!}A^{k}
- e^{A+B} \neq e^{A} e^{B} (but if A and B are commuting matrices, which means that AB = BA, e^{A+B} = e^{A} e^{B})
- \langle x, Ay\rangle = \langle A^T x, y\rangle
1.3 Norms and scalar products
Norm is a qualitative measure of the smallness of a vector and is typically denoted as \Vert x \Vert.
The norm should satisfy certain properties:
- \Vert \alpha x \Vert = \vert \alpha\vert \Vert x \Vert, \alpha \in \mathbb{R}
- \Vert x + y \Vert \leq \Vert x \Vert + \Vert y \Vert (triangle inequality)
- If \Vert x \Vert = 0 then x = 0
The distance between two vectors is then defined as d(x, y) = \Vert x - y \Vert. The most well-known and widely used norm is Euclidean norm: \Vert x \Vert_2 = \sqrt{\sum_{i=1}^n |x_i|^2}, which corresponds to the distance in our real life. If the vectors have complex elements, we use their modulus.
Euclidean norm, or 2-norm, is a subclass of an important class of p-norms:
\Vert x \Vert_p = \Big(\sum_{i=1}^n |x_i|^p\Big)^{1/p}. There are two very important special cases. The infinity norm, or Chebyshev norm is defined as the element of the maximal absolute value: \Vert x \Vert_{\infty} = \max_i | x_i| L_1 norm (or Manhattan distance) which is defined as the sum of modules of the elements of x:
\Vert x \Vert_1 = \sum_i |x_i|
L_1 norm plays a very important role: it all relates to the compressed sensing methods that emerged in the mid-00s as one of the most popular research topics. The code for the picture below is available here: 👨💻
In some sense there is no big difference between matrices and vectors (you can vectorize the matrix), and here comes the simplest matrix norm Frobenius norm: \Vert A \Vert_F = \left(\sum_{i=1}^m \sum_{j=1}^n |a_{ij}|^2\right)^{1/2} Spectral norm, \Vert A \Vert_2 is one of the most used matrix norms (along with the Frobenius norm).
\Vert A \Vert_2 = \sup_{x \ne 0} \frac{\Vert A x \Vert_2}{\Vert x \Vert_{2}}, It can not be computed directly from the entries using a simple formula, like the Frobenius norm, however, there are efficient algorithms to compute it. It is directly related to the singular value decomposition (SVD) of the matrix. It holds
\Vert A \Vert_2 = \sigma_1(A) = \sqrt{\lambda_{\max}(A^TA)}
where \sigma_1(A) is the largest singular value of the matrix A.
The standard scalar (inner) product between vectors x and y from \mathbb{R}^n is given by \langle x, y \rangle = x^T y = \sum\limits_{i=1}^n x_i y_i = y^T x = \langle y, x \rangle
Here x_i and y_i are the scalar i-th components of corresponding vectors.
The standard scalar (inner) product between matrices X and Y from \mathbb{R}^{m \times n} is given by
\langle X, Y \rangle = \text{tr}(X^T Y) = \sum\limits_{i=1}^m\sum\limits_{j=1}^n X_{ij} Y_{ij} = \text{tr}(Y^T X) = \langle Y, X \rangle
1.4 Eigenvalues, eigenvectors, and the singular-value decomposition
1.4.1 Eigenvalues
A scalar value \lambda is an eigenvalue of the n \times n matrix A if there is a nonzero vector q such that Aq = \lambda q.
The vector q is called an eigenvector of A. The matrix A is nonsingular if none of its eigenvalues are zero. The eigenvalues of symmetric matrices are all real numbers, while nonsymmetric matrices may have imaginary eigenvalues. If the matrix is positive definite as well as symmetric, its eigenvalues are all positive real numbers.
Suppose A \in S_n, i.e., A is a real symmetric n \times n matrix. Then A can be factorized as
A = Q\Lambda Q^T,
where Q \in \mathbb{R}^{n \times n} is orthogonal, i.e., satisfies Q^T Q = I, and \Lambda = \text{diag}(\lambda_1, \ldots , \lambda_n). The (real) numbers \lambda_i are the eigenvalues of A and are the roots of the characteristic polynomial \text{det}(A - \lambda I). The columns of Q form an orthonormal set of eigenvectors of A. The factorization is called the spectral decomposition or (symmetric) eigenvalue decomposition of A. 2
We usually order the eigenvalues as \lambda_1 \geq \lambda_2 \geq \ldots \geq \lambda_n. We use the notation \lambda_i(A) to refer to the i-th largest eigenvalue of A \in S. We usually write the largest or maximum eigenvalue as \lambda_1(A) = \lambda_{\text{max}}(A), and the least or minimum eigenvalue as \lambda_n(A) = \lambda_{\text{min}}(A).
The largest and smallest eigenvalues satisfy
\lambda_{\text{min}} (A) = \inf_{x \neq 0} \dfrac{x^T Ax}{x^T x}, \qquad \lambda_{\text{max}} (A) = \sup_{x \neq 0} \dfrac{x^T Ax}{x^T x}
and consequently \forall x \in \mathbb{R}^n (Rayleigh quotient):
\lambda_{\text{min}} (A) x^T x \leq x^T Ax \leq \lambda_{\text{max}} (A) x^T x
The condition number of a nonsingular matrix is defined as
\kappa(A) = \|A\|\|A^{-1}\|
If we use spectral matrix norm, we can get:
\kappa(A) = \dfrac{\sigma_{\text{max}}(A)}{\sigma _{\text{min}}(A)}
If, moreover, A \in \mathbb{S}^n_{++}: \kappa(A) = \dfrac{\lambda_{\text{max}}(A)}{\lambda_{\text{min}}(A)}
1.4.2 Singular value decomposition
Suppose A \in \mathbb{R}^{m \times n} with rank A = r. Then A can be factored as
A = U \Sigma V^T , \quad
where U \in \mathbb{R}^{m \times r} satisfies U^T U = I, V \in \mathbb{R}^{n \times r} satisfies V^T V = I, and \Sigma is a diagonal matrix with \Sigma = \text{diag}(\sigma_1, ..., \sigma_r), such that
\sigma_1 \geq \sigma_2 \geq \ldots \geq \sigma_r > 0.
This factorization is called the singular value decomposition (SVD) of A. The columns of U are called left singular vectors of A, the columns of V are right singular vectors, and the numbers \sigma_i are the singular values. The singular value decomposition can be written as
A = \sum_{i=1}^{r} \sigma_i u_i v_i^T,
where u_i \in \mathbb{R}^m are the left singular vectors, and v_i \in \mathbb{R}^n are the right singular vectors.
1.4.3 Skeleton decomposition
Simple, yet very interesting decomposition is Skeleton decomposition, which can be written in two forms:
A = U V^T \quad A = \hat{C}\hat{A}^{-1}\hat{R}
The latter expression refers to the fun fact: you can randomly choose r linearly independent columns of a matrix and any r linearly independent rows of a matrix and store only them with the ability to reconstruct the whole matrix exactly.
Use cases for Skeleton decomposition are:
- Model reduction, data compression, and speedup of computations in numerical analysis: given rank-r matrix with r \ll n, m one needs to store \mathcal{O}((n + m)r) \ll nm elements.
- Feature extraction in machine learning, where it is also known as matrix factorization
- All applications where SVD applies, since Skeleton decomposition can be transformed into truncated SVD form.
1.5 Canonical tensor decomposition
One can consider the generalization of Skeleton decomposition to the higher order data structure, like tensors, which implies representing the tensor as a sum of r primitive tensors.
1.6 Determinant and trace
The determinant and trace can be expressed in terms of the eigenvalues
\text{det} A = \prod\limits_{i=1}^n \lambda_i, \qquad \text{tr} A = \sum\limits_{i=1}^n \lambda_i
The determinant has several appealing (and revealing) properties. For instance,
- \text{det} A = 0 if and only if A is singular;
- \text{det} AB = (\text{det} A)(\text{det} B);
- \text{det} A^{-1} = \frac{1}{\text{det} \ A}.
Don’t forget about the cyclic property of a trace for arbitrary matrices A, B, C, D (assuming, that all dimensions are consistent):
\text{tr} (ABCD) = \text{tr} (DABC) = \text{tr} (CDAB) = \text{tr} (BCDA)
2 Optimization bingo
2.1 Gradient
Let f(x):\mathbb{R}^n→\mathbb{R}, then vector, which contains all first-order partial derivatives:
\nabla f(x) = \dfrac{df}{dx} = \begin{pmatrix} \frac{\partial f}{\partial x_1} \\ \frac{\partial f}{\partial x_2} \\ \vdots \\ \frac{\partial f}{\partial x_n} \end{pmatrix}
named gradient of f(x). This vector indicates the direction of the steepest ascent. Thus, vector −\nabla f(x) means the direction of the steepest descent of the function in the point. Moreover, the gradient vector is always orthogonal to the contour line in the point.
2.2 Hessian
Let f(x):\mathbb{R}^n→\mathbb{R}, then matrix, containing all the second order partial derivatives:
f''(x) = \nabla^2 f(x) = \dfrac{\partial^2 f}{\partial x_i \partial x_j} = \begin{pmatrix} \frac{\partial^2 f}{\partial x_1 \partial x_1} & \frac{\partial^2 f}{\partial x_1 \partial x_2} & \dots & \frac{\partial^2 f}{\partial x_1\partial x_n} \\ \frac{\partial^2 f}{\partial x_2 \partial x_1} & \frac{\partial^2 f}{\partial x_2 \partial x_2} & \dots & \frac{\partial^2 f}{\partial x_2 \partial x_n} \\ \vdots & \vdots & \ddots & \vdots \\ \frac{\partial^2 f}{\partial x_n \partial x_1} & \frac{\partial^2 f}{\partial x_n \partial x_2} & \dots & \frac{\partial^2 f}{\partial x_n \partial x_n} \end{pmatrix}
In fact, Hessian could be a tensor in such a way: \left(f(x): \mathbb{R}^n \to \mathbb{R}^m \right) is just 3d tensor, every slice is just hessian of corresponding scalar function \left( \nabla^2f_1(x), \ldots, \nabla^2f_m(x)\right).
This matrix provides information about the curvature of the function in different directions.
2.3 Jacobian
The extension of the gradient of multidimensional f(x):\mathbb{R}^n\to\mathbb{R}^m is the following matrix:
J_f = f'(x) = \dfrac{df}{dx^T} = \begin{pmatrix} \frac{\partial f_1}{\partial x_1} & \frac{\partial f_2}{\partial x_1} & \dots & \frac{\partial f_m}{\partial x_1} \\ \frac{\partial f_1}{\partial x_2} & \frac{\partial f_2}{\partial x_2} & \dots & \frac{\partial f_m}{\partial x_2} \\ \vdots & \vdots & \ddots & \vdots \\ \frac{\partial f_1}{\partial x_n} & \frac{\partial f_2}{\partial x_n} & \dots & \frac{\partial f_m}{\partial x_n} \end{pmatrix}
This matrix provides information about the rate of change of the function with respect to its inputs.
2.4 Summary
f(x) : X \to Y; \qquad \frac{\partial f(x)}{\partial x} \in G
X | Y | G | Name |
---|---|---|---|
\mathbb{R} | \mathbb{R} | \mathbb{R} | f'(x) (derivative) |
\mathbb{R}^n | \mathbb{R} | \mathbb{R^n} | \dfrac{\partial f}{\partial x_i} (gradient) |
\mathbb{R}^n | \mathbb{R}^m | \mathbb{R}^{n \times m} | \dfrac{\partial f_i}{\partial x_j} (jacobian) |
\mathbb{R}^{m \times n} | \mathbb{R} | \mathbb{R}^{m \times n} | \dfrac{\partial f}{\partial x_{ij}} |
2.5 Taylor approximations
Taylor approximations provide a way to approximate functions locally by polynomials. The idea is that for a smooth function, we can approximate it by its tangent (for the first order) or by its parabola (for the second order) at a point.
2.5.1 First-order Taylor approximation
The first-order Taylor approximation, also known as the linear approximation, is centered around some point x_0. If f: \mathbb{R}^n \rightarrow \mathbb{R} is a differentiable function, then its first-order Taylor approximation is given by:
f_{x_0}^I(x) = f(x_0) + \nabla f(x_0)^T (x - x_0)
Where:
- f(x_0) is the value of the function at the point x_0.
- \nabla f(x_0) is the gradient of the function at the point x_0.
It is very usual to replace the f(x) with f_{x_0}^I(x) near the point x_0 for simple analysis of some approaches.
2.5.2 Second-order Taylor approximation
The second-order Taylor approximation, also known as the quadratic approximation, includes the curvature of the function. For a twice-differentiable function f: \mathbb{R}^n \rightarrow \mathbb{R}, its second-order Taylor approximation centered at some point x_0 is:
f_{x_0}^{II}(x) = f(x_0) + \nabla f(x_0)^T (x - x_0) + \frac{1}{2} (x - x_0)^T \nabla^2 f(x_0) (x - x_0)
Where:
- \nabla^2 f(x_0) is the Hessian matrix of f at the point x_0.
When using the linear approximation of the function is not sufficient one can consider replacing the f(x) with f_{x_0}^{II}(x) near the point x_0. In general, Taylor approximations give us a way to locally approximate functions. The first-order approximation is a plane tangent to the function at the point x_0, while the second-order approximation includes the curvature and is represented by a parabola. These approximations are especially useful in optimization and numerical methods because they provide a tractable way to work with complex functions.
Note, that even the second-order approximation could become inaccurate very quickly. The code for the picture below is available here: 👨💻
3 Derivatives
3.1 Naive approach
The basic idea of the naive approach is to reduce matrix/vector derivatives to the well-known scalar derivatives. One of the most important practical tricks here is to separate indices of sum (i) and partial derivatives (k). Ignoring this simple rule tends to produce mistakes.
3.2 Differential approach
The guru approach implies formulating a set of simple rules, which allows you to calculate derivatives just like in a scalar case. It might be convenient to use the differential notation here. 3
3.2.1 Differentials
After obtaining the differential notation of df we can retrieve the gradient using the following formula:
df(x) = \langle \nabla f(x), dx\rangle
Then, if we have a differential of the above form and we need to calculate the second derivative of the matrix/vector function, we treat “old” dx as the constant dx_1, then calculate d(df) = d^2f(x)
d^2f(x) = \langle \nabla^2 f(x) dx_1, dx\rangle = \langle H_f(x) dx_1, dx\rangle
3.2.2 Properties
Let A and B be the constant matrices, while X and Y are the variables (or matrix functions).
- dA = 0
- d(\alpha X) = \alpha (dX)
- d(AXB) = A(dX )B
- d(X+Y) = dX + dY
- d(X^T) = (dX)^T
- d(XY) = (dX)Y + X(dY)
- d\langle X, Y\rangle = \langle dX, Y\rangle+ \langle X, dY\rangle
- d\left( \dfrac{X}{\phi}\right) = \dfrac{\phi dX - (d\phi) X}{\phi^2}
- d\left( \det X \right) = \det X \langle X^{-T}, dX \rangle
- d\left(\text{tr } X \right) = \langle I, dX\rangle
- df(g(x)) = \dfrac{df}{dg} \cdot dg(x)
- H = (J(\nabla f))^T
- d(X^{-1})=-X^{-1}(dX)X^{-1}
4 References
- Convex Optimization book by S. Boyd and L. Vandenberghe - Appendix A. Mathematical background.
- Numerical Optimization by J. Nocedal and S. J. Wright. - Background Material.
- Matrix decompositions Cheat Sheet.
- Good introduction
- The Matrix Cookbook
- MSU seminars (Rus.)
- Online tool for analytic expression of a derivative.
- Determinant derivative
- Introduction to Applied Linear Algebra – Vectors, Matrices, and Least Squares - book by Stephen Boyd & Lieven Vandenberghe.
- Numerical Linear Algebra course at Skoltech
Footnotes
A full introduction to applied linear algebra can be found in Introduction to Applied Linear Algebra – Vectors, Matrices, and Least Squares - book by Stephen Boyd & Lieven Vandenberghe, which is indicated in the source. Also, a useful refresher for linear algebra is in Appendix A of the book Numerical Optimization by Jorge Nocedal Stephen J. Wright.↩︎
A good cheat sheet with matrix decomposition is available at the NLA course website.↩︎
The most comprehensive and intuitive guide about the theory of taking matrix derivatives is presented in these notes by Dmitry Kropotov team.↩︎