raw math

Imagine, you have a coordinate system with some points. Parallel to the x-axis is a line, which is connected to each of the points. Your goal is to figure out what the minimum accumulated distance of the points to the line is by moving the horizontal line up and down. Look at this illustration and use the mouse to get a feeling for it:

When you now think, that's an easy one, just take the sum of the squared distances, take the derivative and use it to optimize the problem, then think again. What happens when two points are on a line and another point is far away?

Obviously, the squared distance can't minimize this problem.

Solution to the optimization problem

We have a set of points, where only the y-coordinates are of interest. As such

\[ Y = \{y_1, y_2, ..., y_n\} \subset \mathbb{R} \]

The goal is to find a \(\hat{y}\) such that the sum of the distances is minimal. This means:

\[\underset{\hat{y}}{argmin} \sum\limits_{y\in Y} |y-\hat{y}|\]

If you would take the derivative, you would need to solve for

\[\sum\limits_{y\in Y} sgn(y-\hat{y})=0\]

Of course, you could try every imaginable \(\hat{y}\) and brute force the solution. But we can do much better. Since points can be re-arramged, sorted and separated, we can split the original set into two sets:

\[ P = \{p | p\in Y : p \leq \hat{y}\}\\ Q = \{q | q\in Y : q > \hat{y} \} \]

\[ \begin{array}{rrl} & f(\hat{y}) =& \sum\limits_{y\in Y} |y-\hat{y}| \\ & =& \sum\limits_{p\in P} |p-\hat{y}| + \sum\limits_{q\in Q} |q-\hat{y}| \\ & =& -\sum\limits_{p\in P} (p-\hat{y}) + \sum\limits_{q\in Q} (q-\hat{y}) \\ & =& -(\sum\limits_{p\in P} p - \sum\limits_{p\in P}\hat{y}) + (\sum\limits_{q\in Q} q - \sum\limits_{q\in Q}\hat{y}) \\ & =& -\sum\limits_{p\in P} p + |P|\hat{y} + \sum\limits_{q\in Q} q - |Q|\hat{y} \\ & \frac{\delta}{\delta \hat{y}}f(\hat{y}) =& \frac{\delta}{\delta \hat{y}} (-\sum\limits_{p\in P} p + |P|\hat{y} + \sum\limits_{q\in Q} q - |Q|\hat{y}) \\ & =& |P| - |Q| \\ \Rightarrow & 0 =& |P| - |Q| \\ \Leftrightarrow & |Q| =& |P| \\ \end{array} \]

That is saying, that we arive the optimum if we split the original set into two sets of the same cardinality. If you remember your stats class, you'll easily see what metric can be used:

\[\hat{y} = \widetilde{y}\]

Or spoken more clearly: The value that minimizes the sum of the absolute distance is the median. If you go back up to the plot and click on it, it will jump to the optimal median value.