7.1 Introduction

<aside> 📢 In the remaining modules, we will be using additional visual elements to structure the material:

Plain text on white background constitutes the majority of the content. You should read it.

To highlight particularly important bits of the lesson, we use yellow background.

We also use quotes to highlight the takeaways from longer discussions.

We use grey background for optional content. If you are short on time, you may skip them – they will not be covered in the quiz or the exercise. However, they may contain valuable insights, further clarification, or connections to the machine learning world to strengthen your understanding.

<aside> 💡 We will use blue background in information boxes like this one if they are neither optional, nor particularly important.

</aside>

Reminder: Don’t forget that you can click the triangle button on the left to expand toggle fields such as this one! </aside>

Motivation for probabilistic models

In the previous modules, we solved problems where the ultimate goal was to be able to correctly assign a value to any single given input sample $x$.

$$ x \to f(x, \theta) $$

This includes regression problems, where $f$ is continuous-valued, and classification problems, where $f$ is discrete-valued. Let’s focus on classification for a bit.

Some of the classifiers that we have learned about so far were able to associate a probability with their prediction, so that the output is simply the most probable label.

$$ x \to \argmax_y p(y|x;\theta) $$

One could say that these classifiers have plenty of probabilities built into them. Why is this module important, then?

It turns out there is a third type of machine-learning problem called density estimation.

Density estimation is the fundamental problem of approximating an unknown generative distribution $p_\text{data}$ – that, importantly, only is known from examples in a dataset – with a probabilistic model $p$, which is another distribution that we create based on the data.

If the dataset contains inputs and target values, i.e., $\mathcal{D}= {(\mathcal{X}, \mathcal{Y})}$, then ****the probabilistic model is the joint density $p_{\mathcal{\pmb X}, \mathcal{\pmb{Y}}}$, while in an unsupervised setting, i.e., when we have no target values, the probabilistic model is just the density $p_\mathcal{\pmb{X}}$.

<aside> 💡 In this lesson, we mostly talk about supervised problems, but the content also applies to unsupervised problems.

</aside>

<aside> 🗨️ Nomenclature

Sometimes, we will call the probabilistic model a density estimator, or a density estimate.
For simplicity, we will call this problem density estimation also in the case where the observation space $\Omega$ is discrete, and we are estimating a probability mass function rather than a probability density function, since the same basic principles apply in both cases. </aside>

Why do we want to estimate densities in machine learning?

Take any of our previous classifiers – e.g. a logistic regression, SVM, or neural network model – and train it on a 2D dataset with three classes A, B and C. The trained classifier can then be represented with its decision boundary drawn in the data space: the prediction for any (potentially new) datapoint $\color{purple} x_{\text in}$ is determined by which coloured region the datapoint falls into.

Untitled

Table of Contents

Motivation for probabilistic models

Why do we want to estimate densities in machine learning?