# Problem

In a least-squares, or linear regression, problem, we have measurements $X \in \mathbb{R}^{m \times n}$ and $y \in \mathbb{R}^{m}$ and seek a vector $\theta \in \mathbb{R}^{n}$ such that $X \theta$ is close to $y$. Closeness is defined as the sum of the squared differences:

also known as the $l_2$-norm squared, $\|X \theta - y\|^2_2$

For example, we might have a dataset of $m$ users, each represented by $n$ features. Each row $x_i^\top$ of $X$ is the features for user $i$, while the corresponding entry $y_i$ of $y$ is the measurement we want to predict from $x_i^\top$, such as ad spending. The prediction is given by $x_i^\top \theta$.

We find the optimal $\theta$ by solving the optimization problem

Let $\theta^*$ denote the optimal $\theta$. The quantity $r=X \theta^* - y$ is known as the residual. If $\|r\|_2 = 0$, we have a perfect fit.

Note, that the function needn’t be linear in the argument $x$ but only in the parameters $\theta$ that are to be determined in the best fit.

# Approaches

## Moore–Penrose inverse

If the matrix $X$ is relatively small, we can write down and calculate exact solution:

where $X^\dagger$ is called pseudo-inverse matrix. However, this approach squares the condition number of the problem, which could be an obstacle in case of ill-conditioned huge scale problem.

## QR decomposition

For any matrix $X \in \mathbb{R}^{m \times n}$ there is exists QR decomposition:

where $Q$ is an orthogonal matrix (its columns are orthogonal unit vectors meaning $Q^\top Q=QQ^\top=I$ and $R$ is an upper triangular matrix. It is important to notice, that since $Q^{-1} = Q^\top$, we have:

Now, process of finding theta consists of two steps:

1. Find the QR decomposition of $X$.
2. Solve triangular system $R \theta = Q^\top y$, which is triangular and, therefore, easy to solve.

## Cholesky decomposition

For any positive definite matrix $A \in \mathbb{R}^{n \times n}$ there is exists Cholesky decomposition:

where $L$ is an lower triangular matrix. We have:

Now, process of finding theta consists of two steps:

1. Find the Cholesky decomposition of $X^\top X$.
2. Find the $z_\theta = L\theta$ by solving triangular system $L^\top z_\theta = y$
3. Find the $\theta$ by solving triangular system $L\theta = z_\theta$

Note, that in this case the error stil proportional to the squared condition number.