MySQL Infusion UDF for statistical analysis
Since I use MySQL for the statistical analysis on a project, I wanted to optimize the database queries and learned a lot about stuff like number theory, set theory and partial sums. I took my MySQL UDF, I've published two years ago, for this purpose and added new functions for a deeper statistical analysis. The project is around for a while, so it's time to share things with the public to start a discussion of how things could be further optimized. The source and a small documentation can be found on Github:
Jan Steemann has published a similar project for an older version of MySQL about 8 years ago. I hereby take on the project and will keep it up to date and extend it with functionality that is useful for a RDBMS. The latest version of Jan's UDF remains available for archiving purposes. I haven't taken over all functions and features from Jan and also some functions of my old extension were dropped. What functions are implemented now, can be determined via the readme on Github. The decision of which function would remain in the UDF was made on the following criteria: Is it possible to save traffic between app- and DB server? Is it already possible to formulate a query to get the required information? If so, how bulky would this query be? Would a native query might be even faster?
In the following I'd like to outline briefly the major new features and demonstrate my approaches to implement the algorithms. Not all algorithms are already optimal, so feedback is desirable.
Determine the median is quite simple if two things are given: A sorted list and the total number of elements. It's more difficult if both information are missing and you have an unsorted set of variable size. There is an interesting discussion about this topic on Stackoverflow. Unfortunately, all these approaches require the caching of data, which is impossible with huge amounts of data. The currently published implementation of the median function stores all data temporarily, too, but uses a very fast algorithm, which I found in "Numerical recipes in C". Well, the solution isn't scalable because of it's space complexity limitation. Another idea I've tested was an approximation by introducing a fixed size grid of, say, 10,000 buckets. With this solution, you could probably get fairly accurate results, even in a dimension of billions of records. But without knowledge about the data boundaries, this approach has also it's limitations.
I had another rather naive idea to find the median on a growing set with a very high accuracy. I would again take a buffer of about 10,000 elements, where we start filling buckets at the middle in order to produce a sorted list step by step. Sorting would be done direclty in place by creating slots by moving all interfering elements to the right or to the left with memmove(). The memmove()-offset can be determined via a binary search. It must be noted that the median remains in the middle at all times! If there should be no room on the left or the right, we throw away the data and start counting the displacement. This way we have the chance to shift around and have a buffer of 5.000 elements on each side.
For the above-mentioned project, I've a lot of reports of the form "How many elements are smaller than average?" (LESSAVG), "The sum of how many elements are less than X?" (LESSPART) and "The partial sum of how many elements are smaller than 50% of the total sum?" (LESSPARTPCT).
These are also quite interesting mathematical problems if you assume an unordered set. With native SQL, the problem might be solved with a self-join, like in the example implementation of LESSPARTPCT(x, 0.5):
SELECT COUNT( c ) AS LESSPARTPCT FROM( SELECT x, @x:= @x + x, IF(@x < @sum * 0.5, 1, NULL) AS c FROM t1 JOIN( SELECT @x:= 0, @count:= 0, @sum:= SUM(rnum) FROM t1 )x ORDER BY x )x
On my problem, a self-join wouldn't be possible without creating a temporary table. But that's even more complicated when you have to GROUP your data.
The current implementation uses a binary search, and memmove() to always have a sorted list, as I have mentioned it above for improving the median calculation. Then an operation runs on the sorted data, which leads to the final result of either the average or the sum.
Generally one could say that the UDF-API is quite limited for that type of functions. For performance reasons, we couldn't justify it, but going through a set twice if the UDF init()-function asks for it, a storage of the data in memory would only be necessary if sorting is involved. But even this could solve a flag, that MySQL sends the data sorted to the UDF (using an index if possible).
Other new functions
For the higher statistical moments (skewness and kurtosis) exists a solution that doesn't require caching of the data. Also the covariance has a quite simple online algorithm that I implemented in addition to the other two functions.
Another little function is the Binomeal-Coefficient, so that you can use it directly in calculations in MySQL. The implementation of the noverk()-function is very fast and based on the algorithm on Wikipedia:
if k < 0 or k > n: return 0 if k > n - k: k = n - k c = 1 for i in range(k): c = c * (n - (k - (i + 1))) c = c // (i + 1) print c
What I have further optimized to:
c = 0 if k > 0 and n > 0 and n > k: n = n - k c = 1 for i in range(1, k + 1): c = c * (n + i) // i print c
You might also be interested in the following
- Statistical functions in MySQL
- Running Standard Deviation in MySQL
- Optimal index size for variable text in MySQL
- Fast circular buffer in MySQL
Sorry, comments are closed for this article. Contact me if you want to leave a note.