# Linear Regression

The problem of linear regression consists of finding a linear function

\[f(\mathbf{x}) = \langle\mathbf{w}\cdot\mathbf{x}\rangle + b\]

that best interpolates a given training data set. Geometrically this corrospondes to a hyperplane fitting the given points and more practically it can be seen as a rod held into balance by springs attached to each data point. The following illustrates a one dimensional linear regression:

BILD

The distance \(\xi\) is the error for any particular training example. The linear regression problem has been studied since the 18th century and the best-known solution is that proposed independently by Gauss and Legendre of choosing the line that minimises the sum of the squares of the distances from the training ponts. This technique is known as least squares and is optimal in the case of linear targets dulled by Gaussian noise.

Numerical stability and generalisation considerations motivated the introduction of a variation of the least squares technique, which is analogous to the maximal margin hyperplane in the classification case: choosing a function that minimises a combination of square loss and norm of the vector \(\mathbf{w}\). This solution, proposed by Hoerl and Kennard is known as *ridge regression*. Both these algorithms require the inversion of a matrix, though a simple iterative procedure also exists, known as the Adaline algorithm developed by Widrow and Hoff. These regression techniques can also be used for classification problems with a careful choice of the target values associated with the classes.

## Least Squares

Given a training set with \(\mathbf{x}_i\in\mathbf{X}\subseteq\mathbb{R}^n, y_i\in\mathbf{Y}\subseteq\mathbb{R}\), the problem of lnear regression consists of finding a linear function \(f\) that models the data

\[y=f(\mathbf{x}) = \langle\mathbf{w}\cdot\mathbf{x}\rangle + b\]

The least squares approach prescibes choosing the parameters \((\mathbf{w}, b)\) to minimise the sum of squared deviations of the data,

\[L(\mathbf{w}, b) = \sum\limits_{i=1}^{\ell}(y_i - \langle\mathbf{w}\cdot\mathbf{x_i}\rangle - b)^2\]

The function \(L\) is called the square loss function as it measures the amount of loss associated with the particular choice of parameters. We can minimize \(L\) by differentiating with respect to the parameters \((\mathbf{w}, b)\) and setting the resulting equations to zero. To simplify the following equations a bit, we say \(\hat{\mathbf{w}} = (\mathbf{w}^T, b)^T\) and

\[\hat{\mathbf{X}} = \left(\begin{array}{c}\hat{\mathbf{x}}_1^T\\\hat{\mathbf{x}}_2^T\\\vdots\\\hat{\mathbf{x}}_{\ell}^T\\\end{array}\right)\text{ where }\hat{\mathbf{x}}_i=(\mathbf{x}_i^T, 1)^T\]

With this notation the vectoral difference between the target value and the weighted input data becomes

\[\mathbf{y} - \hat{\mathbf{X}}\hat{\mathbf{w}}\]

Hence, the loss function can be written as

\[L(\hat{\mathbf{w}}) = (\mathbf{y} - \hat{\mathbf{X}}\hat{\mathbf{w}})^T(\mathbf{y} - \hat{\mathbf{X}}\hat{\mathbf{w}})\]

Taking the derivatives of the loss function with respect to \(\hat{\mathbf{w}}\) and setting it equal to zero yields

\[\frac{\partial L }{\partial\hat{\mathbf{w}}} = -2\hat{\mathbf{X}}^T\hat{\mathbf{y}} + 2\hat{\mathbf{X}}^T\hat{\mathbf{X}}\hat{\mathbf{w}}=0\]

Rearranging the terms and dividing by two yields the well-known *normal equation*:

\[\hat{\mathbf{X}}^T\hat{\mathbf{X}}\hat{\mathbf{w}}=\hat{\mathbf{X}}^T\hat{\mathbf{y}}\]