In a least-squares, or linear regression, problem, we have measurements and and seek a vector such that is close to . Closeness is defined as the sum of the squared differences:

also known as the -norm squared,

For example, we might have a dataset of $m$ users, each represented by $n$ features. Each row $x_i^\top$ of $X$ is the features for user , while the corresponding entry $y_i$ of $y$ is the measurement we want to predict from $x_i^\top$, such as ad spending. The prediction is given by .

We find the optimal $\theta$ by solving the optimization problem

Let denote the optimal . The quantity is known as the residual. If , we have a perfect fit.

Note, that the function needn’t be linear in the argument but only in the parameters that are to be determined in the best fit.


Moore–Penrose inverse

If the matrix is relatively small, we can write down and calculate exact solution:

where is called pseudo-inverse matrix. However, this approach squares the condition number of the problem, which could be an obstacle in case of ill-conditioned huge scale problem.

QR decomposition

For any matrix there is exists QR decomposition:

where is an orthogonal matrix (its columns are orthogonal unit vectors meaning and is an upper triangular matrix. It is important to notice, that since , we have:

Now, process of finding theta consists of two steps:

  1. Find the QR decomposition of .
  2. Solve triangular system , which is triangular and, therefore, easy to solve.

Cholesky decomposition

For any positive definite matrix there is exists Cholesky decomposition:

where is an lower triangular matrix. We have:

Now, process of finding theta consists of two steps:

  1. Find the Cholesky decomposition of .
  2. Find the by solving triangular system
  3. Find the by solving triangular system

Note, that in this case the error stil proportional to the squared condition number.


Open In Colab