6.1 Kernel Methods

🎓Intended Learning Outcomes

At the end of this lesson, you should be able to:

Follow the derivation of the primal and dual solution to kernelized linear regression and their benefits and drawbacks.
Understand what a kernel is and how to show a function is a kernel in simple cases by constructing an associated feature map or by constructing a kernel from known basic kernels
Know the Kernel trick and be able to discuss its advantages
Know about the polynomial, Gaussian and Periodic kernel and be able to understand their use cases.
Understand the role of the regularization parameter $\lambda$ in l2/weight decay regularized kernel regression also known as Kernel ridge regression.

The surprising power of linear models

One of the simplest models for regression that you have encountered in this course so far is the linear model where the goal is to find a best fit of a linear function to a dataset $\mathcal{X}=\{(\tilde{x}_1, \tilde{y}_1), \ldots, (\tilde{x}_n, \tilde{y}_n)\}$ of input $\tilde{x}_i\in \mathbb{R}^d$ and output $\tilde{y}_i\in\mathbb{R}$ pairs.

<aside> 💡 Notation: In this section we have slightly shifted notation and call training data pairs $(\tilde{x}_i, \tilde{y}_i)$ instead of $(x_i, y_i)$ as before because we additionally will use $x=(x_1, \ldots, x_d)\in \mathbb{R}^d$ to denote a free variable, that is different from the training data.

</aside>

Let us think about the case of $d=1$, where we can think visually as trying to find a best line $f(x)=\theta_1 + \theta_2 x$, though our data:

What do we mean by “best fit”? Well, one reasonable first choice is to consider the mean squared error between our estimated function values $$and the values we have observed, $y_i$:

Let us rewrite this loss in a slightly fancy way: Let us define $\theta = (\theta_1, \theta_2)^t$ and we define a feature map $\phi(x) = (1, x)^t$, so that the loss can be rewritten as:

If we are in the more general case, where $\tilde{x}i\in \mathbb{R}^d$, we instead fit a linear function $f:\mathbb{R}^d\to \mathbb{R}$ defined by some parameter vector $\theta=(\theta_1, \ldots, \theta{k})^t\in\mathbb{R}^k$ with $k=d+1$ and which takes the form $f(x) = \theta_1 + \sum_{i=2}^{k} \theta_i x_i$, where $x=(x_1, \ldots, x_d)^t\in \mathbb{R}^d$. So that we are fitting a $d$-dimensional plane to the data in general.

We can in general define $\phi(x) =\phi((x_1, \ldots, x_d))= (1, x_1, \ldots, x_d)^t$ $$ and we find that $f(x) = \theta^t \phi(x)$ and we can rewrite the loss in the same form as before in any dimension: