raw database

Calculating the standard deviation in MySQL is a no-brainer by using the build-in aggregate function STDDEV(). If you don't need the original data and only want to save aggregated values in your database, the whole matter is getting more complicated - but is worth from a space and performance point of view.

Let's create a tet table and populate it with some random values:

CREATE TABLE t1(
  group_value INT UNSIGNED NOT NULL,
  value DOUBLE NOT NULL
);

INSERT INTO t1 (group_value, value) 
SELECT FLOOR(RAND() * 5), RAND() * 1000
FROM information_schema.tables
LIMIT 50;

A common query can then take place in order to retrieve the aggregated data:

SELECT group_value, COUNT(*), AVG(value), STD(value)
FROM t1
GROUP BY group_value;
+-------------+----------+--------------------+--------------------+
| group_value | COUNT(*) | AVG(value)         | STD(value)         |
+-------------+----------+--------------------+--------------------+
|           0 |       13 |  558.1130023798883 |    307.22234586301 |
|           1 |       10 |  560.2472776176849 |  287.3629941826582 |
|           2 |       11 | 426.06415503799786 | 219.53149074497404 |
|           3 |        7 |  486.4831761597233 | 315.31028664538627 |
|           4 |        9 |  540.8957763955889 |  221.4856395901883 |
+-------------+----------+--------------------+--------------------+
5 rows in set (0.00 sec)

The result that we've now generated with the query once should, however, be the content of the table at any given time. Of course, you could store the original data and update a summary table after each DML operation. You could also use a cron-job or other hacky things to keep the table in sync. But any of these operations is O(n)! So, let's create the table where we will store our data grouped by the group_value column:

CREATE TABLE t2(
  group_value INT UNSIGNED NOT NULL KEY,
  group_count INT UNSIGNED NOT NULL DEFAULT 1,
  group_avg DOUBLE NOT NULL,
  group_var DOUBLE NOT NULL DEFAULT 0
);

I got the idea for the following solution when I dealt with online algorithms for the calculation of statistical moments for the skewness and kurtosis functions of my MySQL Infusion UDF. The standard deviation is calculated quite simple in this context, because it's only based on the 2nd statistical moment, and can be produced with the following algorithm:

num = 0
avg = 0
m2 = 0

for i in data
  num = num + 1
  m2 = m2 + (i - avg) * (i - avg + (i - avg) / num)
  avg = avg + (i - avg) / num

std = sqrt(m2 / num)

Applied to MySQL, we formulate the following insert command

INSERT INTO t2 (group_value, group_avg)
SELECT group_value, value
FROM t1
ON DUPLICATE KEY UPDATE
   group_count = group_count + 1,
   group_var = group_var + (VALUES(group_avg) - group_avg) * (VALUES(group_avg) - (group_avg + (VALUES(group_avg) - group_avg) / group_count)),
   group_avg = group_avg + (VALUES(group_avg) - group_avg) / group_count;

SELECT group_value GVal, group_count GCnt, group_avg GAvg, SQRT(group_var / group_count) GStd
FROM t2;
+------+------+-------------------+--------------------+
| GVal | GCnt | GAvg              | GStd               |
+------+------+-------------------+--------------------+
|    0 |   13 | 558.1130023798883 |    307.22234586301 |
|    1 |   10 | 560.2472776176849 |  287.3629941826582 |
|    2 |   11 | 426.0641550379979 | 219.53149074497404 |
|    3 |    7 | 486.4831761597233 | 315.31028664538627 |
|    4 |    9 | 540.8957763955889 |  221.4856395901883 |
+------+------+-------------------+--------------------+
5 rows in set (0.00 sec)

Up to this point it was pretty easy to formulate the queries. However, if you want to group the agregated table again (for example because group_value is a combined index), it gets a little more complicated. In this example, I'll split the table into two halves by using the grouping criterion (g <= 2). Of course, it could also be grouped by any other column (or expression, if you're aware of the non-index usage):

SELECT group_value <= 2 GVal, GROUP_CONCAT(DISTINCT group_value) GIds, COUNT(*) GCnt, AVG(value) GAvg, STD(value) GStd, SUM(value) GSum
FROM t1
GROUP BY 1;
+------+-------+------+-------------------+--------------------+--------------------+
| GVal | GIds  | GCnt | GAvg              | GStd               | GSum               |
+------+-------+------+-------------------+--------------------+--------------------+
|    0 | 3,4   |   16 | 517.0902637923976 | 267.99077800353365 |  8273.444220678362 |
|    1 | 0,2,1 |   34 | 516.0190444862757 |  282.5581110287621 | 17544.647512533375 |
+------+-------+------+-------------------+--------------------+--------------------+
2 rows in set (0.00 sec)

The same result can be re-produced with our aggregate table:

SELECT group_value <= 2 GVal, GROUP_CONCAT(group_value) GIds, SUM(group_count) GCnt, SUM(group_avg * group_count) / SUM(group_count) GAvg, SUM(group_avg * group_count) GSum
FROM t2
GROUP BY 1;
+------+-------+------+-------------------+--------------------+
| GVal | GIds  | GCnt | GAvg              | GSum               |
+------+-------+------+-------------------+--------------------+
|    0 | 3,4   |   16 | 517.0902637923976 |  8273.444220678362 |
|    1 | 0,1,2 |   34 | 516.0190444862757 | 17544.647512533375 |
+------+-------+------+-------------------+--------------------+
2 rows in set (0.00 sec)

Okay, but how do we group the standard deviation? In the example above, we summarize the group 3 and 4 into 0 and 0, 1 and 2 into 1. But how does this affect the standard deviation? After some research for this problem I found the propagation of errors, but I didn't understood it so quickly, so let's do it by ourselves. The variance (which is the square of standard deviation) is defined by:

\[ \sigma_x^2 = \frac{1}{n-1} \sum_{i=0}^n (x_i - \bar{x})^2 \]

Let's play with the example \( G = {3, 4} \), which countains \( m=7 \) and \( n=9 \) elements. I call the original elements of the series \( x_1, x_2, ..., x_m \) and \( y_1, y_2, ..., y_n \), with their corresponding variances \( \sigma_x^2 \) and \( \sigma_y^2 \). I define the new variance for GVal=0 as

\[ \sigma_z^2 = \frac{1}{m+n-1}\sum_{i=0}^{n+m} (z_i-\bar z)^2 \]

which is the same as

\[ \sigma_z^2 = \frac{1}{m+n-1}\sum_{i=0}^m (x_i-\bar z)^2 + \sum_{i=0}^n (y_j-\bar z)^2 \]

and again:

\[ \sigma_z^2 = m * (\bar x - \bar z)^2 + m * \sigma_x^2 + 2 * (\bar x - \bar z) * \sum_{i=0}^m (x_i-\bar x) + n * (\bar y - \bar z)^2 + n * \sigma_y^2 + 2 * (\bar y - \bar z) * \sum_{j=0}^n (y_j-\bar y) \]

It all looks a bit messy, but because

\[ \sum_{i=0}^m (x_i-\bar x) = 0 \]

and

\[ \sum_{j=0}^n (y_j-\bar y) = 0, \]

the equation can be simplified to:

\[ \sigma_z^2 = m * (\bar x - \bar z)^2 + m * \sigma_x^2 + n * (\bar y - \bar z)^2 + n * \sigma_y^2 . \]

The only missing piece of the puzzle is \( \bar z \), which can be determined by a different group operation with the following equation:

\( \bar z = \frac{m\bar x + n\bar y}{m + n} \)

All the information we have collected now can be implemented in MySQL as follows:

SELECT group_value <= 2 GVal, SQRT(SUM(group_count * group_var / group_count + group_count * POW(group_avg - W, 2)) / SUM(group_count)) GStd
FROM t2 A
JOIN (
   SELECT group_value <= 2 GVal, SUM(group_avg * group_count) / SUM(group_count) W
   FROM t2
   group by 1
)B ON (group_value &lt= 2) = B.GVal
GROUP BY 1;
+------+--------------------+
| GVal | GStd               |
+------+--------------------+
|    0 |  267.9907780035337 |
|    1 | 282.55811102876214 |
+------+--------------------+
2 rows in set (0.00 sec)

I skip integrating the query into the top query for clarity and legibility. I wonder how I could save the join with a clever use of GROUP BY WITH ROLLUP - Does anyone have an idea?

Update of 21.04.2014

TL;DR: If you want to have a simple copy&paste example, have a look at:

SELECT AVG(value) AVG, STD(value) STD
FROM t1
GROUP BY group_value <= 2;

... which becomes this:

SELECT
	SUM(group_avg * group_count) / SUM(group_count) AVG,
	SQRT(SUM(group_sum2) / SUM(group_count) - POW(SUM(group_sum) / SUM(group_count), 2)) STD
FROM t2
GROUP BY group_value <= 2;