At the end of this lesson, you should be able to:
<aside> ⚠️ Math-heavy disclaimer — this module contains a lot of math and theory. What we want you to take away from this module are the main concepts. If you can follow along as you read, know the main principles and understand the terms in the numbered equations, that is what we are expecting.
</aside>
Let us focus for a moment on the question of "how data may arise" for typical machine learning problems. One of the most common types of data is associated with some sort of measurements from the physical world that arise by means of some measurement process $f$ that maps a collection of $n$ real world objects or phenomena $\mathcal{R}$ to a collection of measurements $\mathcal{X}$ about them. Consider for example the figure below and assume we would like to collect data about apartments in Gamla Stan in Stockholm. For each apartment, we may consider its square meter size, current apartment value, number of inhabitants, date of last renovation, average heating costs, etc. You may think of such a dataset of measurements $\mathcal{X}$ as a table with one row for each apartment and multiple columns, one per measurement type. Very commonly, you may in fact think of $f$ as mapping to a vector space of some dimension $n$ (but for some types of data such as text, this mapping to a vector space may not be immediately obvious and require additional parts to a pipeline such as word embeddings into a vector space). Clearly the data $\mathcal{X}$ may form interesting patterns which are important to understand the structure of real world phenomena that we would like to model and uncover with machine learning methods.
Illustration how a dataset $\mathcal{X}$ of measurements may arise from a set of real world objects $\mathcal{R}$: A set of apartments $\mathcal{R}$ in the real world in the image to the left (marked by small squares) are mapped to measurements about them $\mathcal{X}$ in terms of attributes of interest by a measurement process $f$.
<aside> đź’ˇ Exercise Consider a dataset of color photos of fixed pixel resolution, one per apartment in the example above. Can it be thought of in terms of a measurement function that maps real world apartments to a vector space? Describe the dimensionality $n$ of the vector space used to describe images.
</aside>
<aside> đź’ˇ Exercise Can you think of a measurement function that does not map to a vector space? Do you think it could be approximated using a function mapping to a vector space?
</aside>
To describe the structure of our real world objects $\mathcal{R}$ we would like to make quantitative statements in terms of our measurement dataset $\mathcal{X}$, and in particular to measure how far away different parts of the data are from each other.
Specifying or, alternatively, automatically uncovering a notion of how to compare any two data points in our observed dataset in terms of a numeric measure of “distance” or “similarity” is a key step in reasoning about and understanding the structure of data.
If we think for a moment about how to measure differences between the samples of our dataset, we observe that it is often natural to determine distances for individual parts of our measurements. For example, differences in apartment prices may be naturally measured in a fixed currency using the absolute value difference $d_1(price_1, price_2) = |price_1-price_2|$. Similarly, differences in apartment size can be measured in square meters $d_2(size_1, size_2) = |size_1-size_2|$.