Table of contents


Three limiting factors in density estimation

We started the module with parametric density estimation, which follows the following process:

<aside> đź“Ś Parametric density estimation in 3 steps

  1. Select a parametric model family (e.g., a multivariate Gaussian);
  2. Select a fitting principle (e.g., maximum likelihood);
  3. Search for the optimal parameters, guided by the fitting principle. </aside>

We learnt that maximum-likelihood estimation is a dominant paradigm, partly because it has a number of appealing theoretical properties, including consistency and asymptotic efficiency. However, this does not mean that the estimate will converge on the true distribution if we fit simple models in practice. For instance, we motivated mixture models with the example of trying to fit a 2D Gaussian to the Old Faithful dataset, which necessarily under-delivers:

Untitled

Misspecification

The bad situation in the above figure is a consequence of model misspecification, which happens when the data doesn’t come from any of the distributions in our parametric family. In this case, our parametric family assumes that the data is a 2D Gaussian, but the data does not come from a Gaussian distribution. Therefore, none of the models we are considering are able to fit the data perfectly (or even well).

The mathematical requirements of the nice theory behind MLE and MAP are even stricter. It is assumed that the datapoints $x_i$ are i.i.d. (independent and identically distributed) samples drawn from the distribution corresponding to $p_{\pmb x}(x;\theta^\star)$ for some true but unknown parameter value $\theta^\star\in\mathit\Theta$. If this is not true, in the sense the data comes from a distribution that is not included in our parametric family, we say that the model (or, more accurately, the model family) is misspecified. In that case, many optimality theorems and theoretical guarantees in probability and machine learning no longer apply.

The misspecification gap

In practice, our machine-learning models can never perfectly match the behaviour of the data. There will always be misspecification in the real world. Consequently, it is likely that our model will give less than perfect performance on whatever task we use it for.

If our modelling assumptions are wrong, then no matter how carefully we select the model parameters, the best possible fit we can achieve is fundamentally limited, and so is the best possible performance.

We call the gap between these two upper performance bounds the misspecification gap. This term is our own invention, so you won’t run into it in other courses, but we think it is really instructive for thinking about different machine-learning models and their performance. Let’s put this on a scale by visualising a “performance thermometer” (also our own invention):

Untitled

The key observation is that the misspecification gap only depends on the problem we are trying to solve and on the model family we are using to try to solve that problem.

Specifically, the gap is determined by the performance of the best model in our model family. While the problem setup – and thus the optimal (theoretical) performance – frequently is outside our control, exactly what model family we choose to use is completely our own decision, and something that we indeed can control!

The optimisation gap