At the end of this lesson, you should be able to:
<aside> ⚠️ Math heavy lesson: this lesson has lots of derivations, but you don’t need to memorize them all, instead focus on understanding each step to build your understanding of why these methods work.
</aside>
Working with high-dimensional data can be difficult in many ways. First off, there are computational considerations when it comes to applying machine learning on high-dimensional data. For example, imagine that you are a data scientist at a software firm and you are tracking 300 different metrics about how users use the firm’s product. Your task is to predict the effect of making changes to the software. Are all of these metrics useful? What if in reality, just 20 of those metrics would be enough to retain 99.9% of the performance of your model? Clearly this would be a significant improvement and simplification. In particular, if we could reduce the dimensionality of the inputs to just a few abstract features, then we could speed up model training and reduce model sizes drastically. The process of dimensionality reduction is attempting to achieve just this: to reduce the dimensionality of the data to a low-dimensional setting, where we can analyze data more easily, and — in the case of dimensionality reduction to 2 or 3 dimensions, can easily visualize the data.
A classical method for dimensionality reduction is called Principal Component Analysis where the motivation is to reduce the dimensionality of the data by retaining only those dimensions of maximum total variance. Let us remind ourselves about some basic terms from statistics.
<aside> 💡 The a (biased) empirical estimate for the covariance matrix for an i.i.d dataset $\mathcal{X}=\{x_1, \ldots, x_n\}\subset \mathbb{R}^d$ is given by:
$$ ⁍⁍, $$
where $\bar{x}=\frac{1}{n}\sum_{i=1}^n x_i$ is the empirical mean of $\mathcal{X}$.
The matrix $S$ converges to the underlying covariance matrix that you are familiar with
$$ ⁍ $$
for a random variable $X\in \mathbb{R}^d$ that gives rise to the dataset $\mathcal{X}$ and the $(i, j)$ entry of $K_{X, X}$ measures the covariance between the $i$th and $j$th coordinate.
Note in particular that the $i$th diagonal entry of $S$ is the classical sample variance of the $i$th coordinates of the dataset $\mathcal{X}$.
</aside>
Remember that $S$ has a very intuitive interpretation: Eigenvectors of $S$ span orthogonal subspaces along which the data is uncorrelated, and Eigenvectors with largest eigenvalue correspond to directions along which the data has the most variance as illustrated in the figure below: