At the end of this lesson, you should be able to:
In the previous lesson, we covered some of the fundamental approaches to measuring distances by defining them through metrics. One common approach are examples where the metric was defined on a vector space. However, there are cases where the data really lives on a specific lower-dimensional space within such a vector space. For example, the following plot shows the so-called “Swiss roll” dataset, and an “S” surface of 3-dimensional points.
Stop reading for a moment, and think about the following question*:***
*The idea of data living on a lower dimensional space can be made more rigorous in general, by requiring that the data lives on a manifold of lower dimension, possibly up to some noise. While we won’t make this more precise here, you can think of a smooth manifold of dimension $k$ as a space that is locally parameterizable by $k$ independent parameters - i.e. around each point of the manifold we can find a local coordinate system that parameterizes the points nearby. For example, a sphere $\{x\in \mathbb{R}^n: ||x||=1\}$ is an $n-1$ dimensional smooth manifold and the two surfaces above can be thought of as 2 dimensional manifolds as well.
In the figures above, our lower dimensional manifold on which the data “lives” is a 2-dimensional surface. On the left, it looks like a rolled up piece of paper “a swiss roll”, while on the right like a piece of paper that is bent to the shape of a letter “S”.*
<aside> đź’ˇ The manifold hypothesis in machine learning postulates that many naturally occurring high-dimensional datasets actually lie along a low dimensional manifold. If this is true for a particular dataset in high-dimensions, then it is possible to find lower-dimensional representations of the high-dimensional datapoints, without losing information about them. In the next section, this will be a key starting point for studying dimensionality reduction techniques.
</aside>
In the case of data distributed on a lower dimensional manifold in a metric space, we consider a natural distance measure between two points to be the length of the shortest path on that surface. Such a shortest path is called a geodesic. The sketch below illustrates the difference between taking the ambient Euclidean distance between two points and the length of the shortest path on the surface instead: While P and Q are relatively close in Euclidean space, they are considered far apart when geodesic distance is considered:
The concept of geodesic distance between points is in fact a very rich and natural in a variety of settings illustrated in the figure below. We have already observed that, since the shortest path between two points in Euclidean space is in fact a straight line segment whose length is equal to the Euclidean distance, both geodesic distance and Euclidean distance in $\mathbb{R}^n$ itself coincide. On a space like a sphere $\mathbb{S}^2$ centered at the origin, the geodesic distance corresponds to the length of the shortest circular arc between the points which up a scale factor is equal to the angle between the points (note how this relates to the cosine similarity!). In some problems, such as a robotics setting, we may be given position data about a robot in the plane with information about an obstacle region $\Omega$ that we know the robot cannot enter (rightmost part of figure below). In that case, geodesic distance between any two positions of the robot is also very natural and corresponds to the shortest path avoiding such an obstacle.
Unfortunately, in real datasets, we are rarely given an analytic description of what the underlying manifold is on which the data was sampled, but some notable exceptions exists such as data arising in robotic configuration spaces or phenomena where we collect data with known physical constraints. Our goal is hence to try to approximate this underlying geodesic distance as well as we can based just on the available data sample.