Imagine, that you have a dataset of points. Your goal is to choose orthogonal axes, that describe your data the most informative way. To be precise, we choose first axis in such a way, that maximize the variance (expressiveness) of the projected data. All the following axes have to be orthogonal to the previously chosen ones, while satisfy largest possible variance of the projections.
Let’s take a look at the simple 2d data. We have a set of blue points on the plane. We can easily see that the projections on the first axis (red dots) have maximum variance at the final position of the animation. The second (and the last) axis should be orthogonal to the previous one. source
This idea could be used in a variety ways. For example, it might happen, that projection of complex data on the principal plane (only 2 components) bring you enough intuition for clustering. The picture below plots projection of the labeled dataset onto the first to principal components (PC’s), we can clearly see, that only two vectors (these PC’s) would be enogh to differ Finnish people from Italian in particular dataset (celiac disease (Dubois et al. 2010)) source
The first component should be defined in order to maximize variance. Suppose, we’ve already normalized the data, i.e. , then sample variance will become the sum of all squared projections of data points to our vector , which implies the following optimization problem:
since we are looking for the unit vector, we can reformulate the problem:
It is known, that for positive semidefinite matrix such vector is nothing else, but eigenvector of , which corresponds to the largest eigenvalue. The following components will give you the same results (eigenvectors).
So, we can conclude, that the following mapping:
describes the projection of data onto the principal components, where contains first (by the size of eigenvalues) eigenvectors of .
Now we’ll briefly derive how SVD decomposition could lead us to the PCA.
Firstly, we write down SVD decomposition of our matrix:
and to its transpose:
Then, consider matrix :
Which corresponds to the eigendecomposition of matrix , where stands for the matrix of eigenvectors of , while contains eigenvalues of .
At the end:
The latter formula provide us with easy way to compute PCA via SVD with any number of principal components:
Consider the classical Iris dataset source We have the dataset matrix