At the end of this lesson, you should be able to:
One of the simplest models for regression that you have encountered in this course so far is the linear model where the goal is to find a best fit of a linear function to a dataset $\mathcal{X}=\{(\tilde{x}_1, \tilde{y}_1), \ldots, (\tilde{x}_n, \tilde{y}_n)\}$ of input $\tilde{x}_i\in \mathbb{R}^d$ and output $\tilde{y}_i\in\mathbb{R}$ pairs.
<aside> đź’ˇ Notation: In this section we have slightly shifted notation and call training data pairs $(\tilde{x}_i, \tilde{y}_i)$ instead of $(x_i, y_i)$ as before because we additionally will use $x=(x_1, \ldots, x_d)\in \mathbb{R}^d$ to denote a free variable, that is different from the training data.
</aside>
Let us think about the case of $d=1$, where we can think visually as trying to find a best line $f(x)=\theta_1 + \theta_2 x$, though our data:
What do we mean by “best fit”? Well, one reasonable first choice is to consider the mean squared error between our estimated function values $$and the values we have observed, $y_i$:
Let us rewrite this loss in a slightly fancy way: Let us define $\theta = (\theta_1, \theta_2)^t$ and we define a feature map $\phi(x) = (1, x)^t$, so that the loss can be rewritten as:
If we are in the more general case, where $\tilde{x}i\in \mathbb{R}^d$, we instead fit a linear function $f:\mathbb{R}^d\to \mathbb{R}$ defined by some parameter vector $\theta=(\theta_1, \ldots, \theta{k})^t\in\mathbb{R}^k$ with $k=d+1$ and which takes the form $f(x) = \theta_1 + \sum_{i=2}^{k} \theta_i x_i$, where $x=(x_1, \ldots, x_d)^t\in \mathbb{R}^d$. So that we are fitting a $d$-dimensional plane to the data in general.
We can in general define $\phi(x) =\phi((x_1, \ldots, x_d))= (1, x_1, \ldots, x_d)^t$ $$ and we find that $f(x) = \theta^t \phi(x)$ and we can rewrite the loss in the same form as before in any dimension: