🎓Intended Learning Outcomes

At the end of this lesson, you should be able to:

Understand what dimensionality reduction attempts to achieve
Describe PCA and understand its derivation in terms of variance retention
Be able to start to implement PCA in a practical use case scenario (guided in exercise 8)
Be aware of the limitations of PCA
Be familiar with the MDS family of methods and be able to discuss the different main approaches to MDS and differences between them.
Understand the use case scenarios of the ISOMAP algorithm, its inputs and outputs.

<aside> ⚠️ Math heavy lesson: this lesson has lots of derivations, but you don’t need to memorize them all, instead focus on understanding each step to build your understanding of why these methods work.

</aside>

What is dimensionality reduction and why is it useful?

Working with high-dimensional data can be difficult in many ways. First off, there are computational considerations when it comes to applying machine learning on high-dimensional data. For example, imagine that you are a data scientist at a software firm and you are tracking 300 different metrics about how users use the firm’s product. Your task is to predict the effect of making changes to the software. Are all of these metrics useful? What if in reality, just 20 of those metrics would be enough to retain 99.9% of the performance of your model? Clearly this would be a significant improvement and simplification. In particular, if we could reduce the dimensionality of the inputs to just a few abstract features, then we could speed up model training and reduce model sizes drastically. The process of dimensionality reduction is attempting to achieve just this: to reduce the dimensionality of the data to a low-dimensional setting, where we can analyze data more easily, and — in the case of dimensionality reduction to 2 or 3 dimensions, can easily visualize the data.

PCA & Variance retention as a simple motivation

A classical method for dimensionality reduction is called Principal Component Analysis where the motivation is to reduce the dimensionality of the data by retaining only those dimensions of maximum total variance. Let us remind ourselves about some basic terms from statistics.

Recap of basic terms from statistics:

<aside> 💡 The a (biased) empirical estimate for the covariance matrix for an i.i.d dataset $\mathcal{X}=\{x_1, \ldots, x_n\}\subset \mathbb{R}^d$ is given by:

$$ ⁍⁍, $$

where $\bar{x}=\frac{1}{n}\sum_{i=1}^n x_i$ is the empirical mean of $\mathcal{X}$.

The matrix $S$ converges to the underlying covariance matrix that you are familiar with

$$ ⁍ $$

for a random variable $X\in \mathbb{R}^d$ that gives rise to the dataset $\mathcal{X}$ and the $(i, j)$ entry of $K_{X, X}$ measures the covariance between the $i$th and $j$th coordinate.

Note in particular that the $i$th diagonal entry of $S$ is the classical sample variance of the $i$th coordinates of the dataset $\mathcal{X}$.

</aside>

Note: the unbiased empirical covariance matrix.

Remember that $S$ has a very intuitive interpretation: Eigenvectors of $S$ span orthogonal subspaces along which the data is uncorrelated, and Eigenvectors with largest eigenvalue correspond to directions along which the data has the most variance as illustrated in the figure below:

Table of contents

🎓Intended Learning Outcomes

What is dimensionality reduction and why is it useful?

PCA & Variance retention as a simple motivation

Recap of basic terms from statistics: