🎓Intended Learning Outcomes

At the end of this lesson, you should be able to:

The surprising power of linear models

One of the simplest models for regression that you have encountered in this course so far is the linear model where the goal is to find a best fit of a linear function to a dataset $\mathcal{X}=\{(\tilde{x}_1, \tilde{y}_1), \ldots, (\tilde{x}_n, \tilde{y}_n)\}$ of input $\tilde{x}_i\in \mathbb{R}^d$ and output $\tilde{y}_i\in\mathbb{R}$ pairs.

<aside> đź’ˇ Notation: In this section we have slightly shifted notation and call training data pairs $(\tilde{x}_i, \tilde{y}_i)$ instead of $(x_i, y_i)$ as before because we additionally will use $x=(x_1, \ldots, x_d)\in \mathbb{R}^d$ to denote a free variable, that is different from the training data.

</aside>

Let us think about the case of $d=1$, where we can think visually as trying to find a best line $f(x)=\theta_1 + \theta_2 x$, though our data:

linearregression.png

What do we mean by “best fit”? Well, one reasonable first choice is to consider the mean squared error between our estimated function values $$and the values we have observed, $y_i$:

simpleloss.png

Let us rewrite this loss in a slightly fancy way: Let us define $\theta = (\theta_1, \theta_2)^t$ and we define a feature map $\phi(x) = (1, x)^t$, so that the loss can be rewritten as:

simpleloss2.png

If we are in the more general case, where $\tilde{x}i\in \mathbb{R}^d$, we instead fit a linear function $f:\mathbb{R}^d\to \mathbb{R}$ defined by some parameter vector $\theta=(\theta_1, \ldots, \theta{k})^t\in\mathbb{R}^k$ with $k=d+1$ and which takes the form $f(x) = \theta_1 + \sum_{i=2}^{k} \theta_i x_i$, where $x=(x_1, \ldots, x_d)^t\in \mathbb{R}^d$. So that we are fitting a $d$-dimensional plane to the data in general.

We can in general define $\phi(x) =\phi((x_1, \ldots, x_d))= (1, x_1, \ldots, x_d)^t$ $$ and we find that $f(x) = \theta^t \phi(x)$ and we can rewrite the loss in the same form as before in any dimension:

simpleloss2.png