In the previous modules, we solved problems where the ultimate goal was to be able to correctly assign a value to any single given input sample $x$.
$$ x \to f(x, \theta) $$
This includes regression problems, where $f$ is continuous-valued, and classification problems, where $f$ is discrete-valued. Letβs focus on classification for a bit.
Some of the classifiers that we have learned about so far were able to associate a probability with their prediction, so that the output is simply the most probable label.
$$ x \to \argmax_y p(y|x;\theta) $$
One could say that these classifiers have plenty of probabilities built into them. Why is this module important, then?
It turns out there is a third type of machine-learning problem called density estimation.
Density estimation is the fundamental problem of approximating an unknown generative distribution $p_\text{data}$ β that, importantly, only is known from examples in a dataset β with a probabilistic model $p$, which is another distribution that we create based on the data.
If the dataset contains inputs and target values, i.e., $\mathcal{D}= {(\mathcal{X}, \mathcal{Y})}$, then ****the probabilistic model is the joint density $p_{\mathcal{\pmb X}, \mathcal{\pmb{Y}}}$, while in an unsupervised setting, i.e., when we have no target values, the probabilistic model is just the density $p_\mathcal{\pmb{X}}$.
<aside> π‘ In this lesson, we mostly talk about supervised problems, but the content also applies to unsupervised problems.
</aside>
<aside> π¨οΈ Nomenclature
Take any of our previous classifiers β e.g. a logistic regression, SVM, or neural network model β and train it on a 2D dataset with three classes A, B and C. The trained classifier can then be represented with its decision boundary drawn in the data space: the prediction for any (potentially new) datapoint $\color{purple} x_{\text in}$ is determined by which coloured region the datapoint falls into.