raw machine learning

Regression Introduction

Robert Eisele

Definition

Given training data \(\mathcal{D}=\{(\mathbf{x}^{(i)}, y^{(i)})\}^N_{i=1}\) and two features \(X\) and \(Y\), a regression is a model \(y = f(x; a, b, c, ...)\) that approximates the measurements best.

The simplest regression model is the linear regression.

Linear Regression

The problem of linear regression consists of finding a linear function

\[f(x) = mx + b\]

that best describes a given training data set. But what does that mean? A line typically is described by two points and things are getting complicated by adding a third. Laplace was working on that problem and decided to add a latent variable \(\epsilon\) to express the ignorance between the model and the actual data points.

\[f(x) = mx + b+\epsilon\]

Now the problem can be intuitively seen as a rod held into balance by springs attached to each data point. The following illustrates a one dimensional linear regression:

The linear regression problem has been studied since the 18th century and the best-known solution was proposed independently by Gauss and Legendre of choosing the line that minimises the sum of the squares of the distances from the training ponts. This technique is known as least squares and is optimal in the case of linear targets dulled by Gaussian noise.

Choose the model

The model of linear regression is

\[y = f(x) = wx + b+\epsilon\]

with \(w\) being the weight, \(b\) the bias and \(y\) the prediction.

Generalize the model

In the multivariable case, \(x\) becomes a vector \(\mathbf{x}_1, \mathbf{x}_2, ..., \mathbf{x}_D\), the weight \(w\) becomes a vector \(\mathbf{w}_1, \mathbf{w}_2, ..., \mathbf{w}_D\) with \(D\) features and the model becomes:

\[y = \sum\limits_{i=1}^D \mathbf{w}_i\mathbf{x}_i + b+\epsilon = \langle\mathbf{w}\cdot\mathbf{x}\rangle + b+\epsilon\]

which fits a hyperplane to the given points.

Choose a loss function

The error for each data point is \(\epsilon^{(i)}\). Since we want a distinct solution, we define the loss function as the sum of all squared errors to be able to find the parameters \((\mathbb{w}, b)\) that minimize the sum of the squared deviations of the data. Since the sum of squared numbers is a square, we ensure the distinct solution.

\[\mathcal{L}(\mathbf{w}, b) = \sum\limits_{i=1}^N(\epsilon^{(i)})^2 = \sum\limits_{i=1}^N(\underbrace{y^{(i)} - \langle\mathbf{w}\cdot\mathbf{x}^{(i)}\rangle - b}_{\text{residual}})^2\]

Vectorizing the problem

To simplify the upcoming equations, we augment the vector \(\mathbf{w}\) such that \(\hat{\mathbf{w}} = (\mathbf{w}^T, b)^T\) and vector \(\mathbf{x}\) such that \(\hat{\mathbf{x}}=(\mathbf{x}^T, 1)^T\). Additionally we put all column feature vectors in a matrix \(\mathbf{X}\) as rows:

\[{\mathbf{X}} = \left(\begin{array}{l}\hat{\mathbf{x}}^{(1)}\\\hat{\mathbf{x}}^{(2)}\\\vdots\\\hat{\mathbf{x}}^{(N+1)}\\\end{array}\right)\]

With this new notation, the vectoral difference between the target value and the weighted input data becomes

\[\mathbf{y} - {\mathbf{X}}\hat{\mathbf{w}}\]

Hence, the loss function can be written as

\[\mathcal{L}(\hat{\mathbf{w}}) = (\mathbf{y} - {\mathbf{X}}\hat{\mathbf{w}})^T(\mathbf{y} - {\mathbf{X}}\hat{\mathbf{w}})\]

Formulate the optimization problem

We can now minimize the loss function \(\mathcal{L}\) by differentiating with respect to the parameter \(\hat{\mathbf{w}}\) and setting the resulting equations to zero:

\[\frac{\partial L }{\partial\hat{\mathbf{w}}} = -2{\mathbf{X}}^T\hat{\mathbf{y}} + 2{\mathbf{X}}^T{\mathbf{X}}\hat{\mathbf{w}}\overset{!}{=}0\]

Rearranging the terms and dividing by two yields the well-known normal equation:

\[{\mathbf{X}}^T{\mathbf{X}}\hat{\mathbf{w}}={\mathbf{X}}^T\hat{\mathbf{y}}\]

and if the inverse of \({\mathbf{X}}^T{\mathbf{X}}\) (called the Gramian Matrix) exists, the solution of the least squares problem is

\[\hat{\mathbf{w}}=({\mathbf{X}}^T{\mathbf{X}})^{-1}{\mathbf{X}}^T\hat{\mathbf{y}}\]

with \({\mathbf{X}}^T\hat{\mathbf{y}}\) called the Moment Matrix. The Gramian matrix should be positive (semi-)definite. If it is singular, the pseudo-inverse can be used or a regularizer can be introduced to form a ridge regression.

Simple Linear Regression

When we do not generalize the problem to \(D\) features, we get a simple linear regression for determining how one variable of interest is affected by changes in another variable. When we have the model

\[y=f(x) = ax+b+\epsilon\]

we can again create the loss function:

\[\mathcal{L}(a, b) = \sum\limits_{i=1}^N(\epsilon^{(i)})^2 = \sum\limits_{i=1}^N(y^{(i)} - a x^{(i)} - b)^2\]

And minimize the loss function by calculating the partial derivatives as the necessary condition to be the optimum:

\[\frac{\partial \mathcal{L}(a, b)}{\partial a} = -2\sum\limits_{i=1}^N x^{(i)}(y^{(i)} - a x^{(i)} - b)\overset{!}{=}0\]

\[\frac{\partial \mathcal{L}(a, b)}{\partial b} = -2\sum\limits_{i=1}^N (y^{(i)} - a x^{(i)} - b)\overset{!}{=}0\]

We start with the second derivative we derived and rearrange for \(b\):

\[\begin{array}{rrl} & -2\sum\limits_{i=1}^N(y^{(i)}-ax^{(i)}-b) &=0\\ \Leftrightarrow & \sum\limits_{i=1}^N y^{(i)} - a\sum\limits_{i=1}^Nx^{(i)}&=nb\\ \Leftrightarrow & \frac{1}{N}\sum\limits_{i=1}^N - \frac{a}{N}\sum\limits_{i=1}^Nx^{(i)}&=b\\ \Leftrightarrow&\overline{y} - a\overline{x} &= b \end{array}\]

Which makes sense since the y-intercept is set such that the line goes through the mean of x and y, which describes the center of the data.

We now take the first derivative we derived and rearrange for \(a\):

\[\begin{array}{rrl} & -2\sum\limits_{i=1}^Nx^{(i)}(y^{(i)}-ax^{(i)}-b) &= 0\\ \Leftrightarrow & \sum\limits_{i=1}^Nx^{(i)}(y^{(i)}-ax^{(i)}-(\overline{y} - a\overline{x})) &= 0 \, (\text{substitute } b=\overline{y} - a\overline{x})\\ \Leftrightarrow & \sum\limits_{i=1}^N (x^{(i)}y^{(i)}-ax^{(i)}x^{(i)}-x^{(i)}\overline{y} + ax^{(i)}\overline{x}) &= 0\\ \Leftrightarrow & \sum\limits_{i=1}^N (x^{(i)}y^{(i)}-x^{(i)}\overline{y}) - a\sum\limits_{i=1}^N(x^{(i)}x^{(i)} - x^{(i)}\overline{x}) &= 0\\ \Leftrightarrow & \frac{\sum\limits_{i=1}^N (x^{(i)}y^{(i)}-x^{(i)}\overline{y})}{\sum\limits_{i=1}^N(x^{(i)}x^{(i)} - x^{(i)}\overline{x})} &= a\\ \Leftrightarrow & \frac{\sum\limits_{i=1}^N x^{(i)}y^{(i)}-n\overline{x}\overline{y}}{\sum\limits_{i=1}^N x^{(i)}x^{(i)} - n\overline{x}^2} &= a\\ \end{array}\]

By noticing that \(\sum\limits_{i=1}^N (\overline{x}^2-x^{(i)}\overline{x})=0\) and \(\sum\limits_{i=1}^N(\overline{x}\overline{y}-y^{(i)}\overline{x})=0\) we can simplify further:

\[\begin{array}{rrl} & a&=\frac{\sum\limits_{i=1}^N (x^{(i)}y^{(i)}-x^{(i)}\overline{y}) + \sum\limits_{i=1}^N(\overline{x}\overline{y}-y^{(i)}\overline{x})}{\sum\limits_{i=1}^N(x^{(i)}x^{(i)} - x^{(i)}\overline{x})+\sum\limits_{i=1}^N (\overline{x}^2-x^{(i)}\overline{x})}\\ & &= \frac{\sum\limits_{i=1}^N (x^{(i)}-\overline{x})(y^{(i)}-\overline{y})}{\sum\limits_{i=1}^N(x^{(i)} - \overline{x})^2}\\ &&=\frac{Cov(x, y)}{Var(x)} \end{array}\]

Feature Mapping

If the data has no linear correlation, we could try to fit a low-degree polynomial, which is known as polynomial regression, such as

\[y=\mathbf{w}_3x^3+\mathbf{w}_2x^2+\mathbf{w}_1x+\mathbf{w}_0+\epsilon\]

Do we need a new algorithm for that? No, we can just create a feature map

\[\phi(x) = \left(\begin{array}{c}1\\x\\x^2\\x^3\end{array}\right)\]

And create the regression model

\[\phi(x) = \langle\mathbf{w}\cdot\phi(x)\rangle\]

The problem now is, how well does a model generalize instead of memorizing the data points. A too simple model, like a polynomial of degree one might be an underfit, while a polynomial of higher degree might be too complex and overfit.

The degree of the polynomial becomes a hyperparameter of the model. We need to optimize over the hyperparameters using cross validation - splitting the data ito a training-, validation- and test-set and tune the hyperparameter on the validatio set and finally report the performance on the test-set.

Ridge Regression