# Problem

In a least-squares, or linear regression, problem, we have measurements and and seek a vector such that is close to . Closeness is defined as the sum of the squared differences:

also known as the -norm squared,

For example, we might have a dataset of $m$ users, each represented by $n$ features. Each row $x_i^\top$ of $X$ is the features for user , while the corresponding entry $y_i$ of $y$ is the measurement we want to predict from $x_i^\top$, such as ad spending. The prediction is given by .

We find the optimal $\theta$ by solving the optimization problem

Let denote the optimal . The quantity is known as the residual. If , we have a perfect fit.

Note, that the function needn’t be linear in the argument but only in the parameters that are to be determined in the best fit.

# Approaches

## Moore–Penrose inverse

If the matrix is relatively small, we can write down and calculate exact solution:

where is called pseudo-inverse matrix. However, this approach squares the condition number of the problem, which could be an obstacle in case of ill-conditioned huge scale problem.

## QR decomposition

For any matrix there is exists QR decomposition:

where is an orthogonal matrix (its columns are orthogonal unit vectors meaning and is an upper triangular matrix. It is important to notice, that since , we have:

Now, process of finding theta consists of two steps:

- Find the QR decomposition of .
- Solve triangular system , which is triangular and, therefore, easy to solve.

## Cholesky decomposition

For any positive definite matrix there is exists Cholesky decomposition:

where is an lower triangular matrix. We have:

Now, process of finding theta consists of two steps:

- Find the Cholesky decomposition of .
- Find the by solving triangular system
- Find the by solving triangular system

Note, that in this case the error stil proportional to the squared condition number.