raw database

This article is a bit MySQL-centric but the idea is smoothly adaptable on just about every programming language which offers the ability to access bit's on numbers. Sure, if you don't program, you could also use the mathematical background of this approach.

Of course , the usage of such a method on a normalized database contradicts some design patterns of a RDBMS. But let's bring the idea into effect and don't care about such limitations. I'll write an article about denormalization in a different post. To beforehand a bit, I use this method in a denormalized environment to reduce the index size, have a cleaner index usage (no playing with FORCE() and USE()) and to reduce the database size at all.

Okay, let's dig a little deeper. I started at anytime to optimize as much as possible with binary algorithms. And so I got the really simple idea to store small numbers in a larger big range. A normal integer consumes about 32 bits, with the difference of usable bits in the signed and unsigned version. In some applications you have many predestined numbers, that will never grow over a certain maximum, of let's say 10 for example. You could use the data type TINYINT in MySQL, or short in a C/C++ application, but in every case you have a waste of disk space or memory. The binary representation of 1010 is 10102, which uses 4 digits. Wouldn't it be nice to fill a 32 bit integer with as much as possible small numbers to reduce the storage size? In this example, we could store 8 numbers up to a max value of 24 - 1 in a single unsigned integer. I think you got it :-)

Let's come further with a little example. Think about the binary representation of a small number packed in a big as the marked-out range in the example:

11.1011.0010.00012

Now we need a mathematical way of getting and setting our 102 in this gap, which I've highlighted.

So let's think about, how we could describe a getting of a number in the range from place 4 to 7. To speak a bit more abstract, we call this range x to y on the number n. The problem is, our number is surrounded by other digits on the left and the right (except for the boundaries). The bits on the right hand, are eliminated very simple by a right shift by x places. After that, we create a mask of ones exactly in the size of our gap. You could think about such a mask as follows:

00.0000.11112

The rest is really straightforward. We unify our mask with the right shifted value and have the desired number as result. A simple implementation of a MySQL function, I use on production systems, looks like this:

CREATE FUNCTION getint(n INT UNSIGNED, x INT UNSIGNED, y INT UNSIGNED)
  RETURNS INT UNSIGNED
    DETERMINISTIC
RETURN (n >> x) & ((2 << (y - x)) - 1);

I optimized the matter of 2n+1 by left-shifting 2 by the originated delta. So I've saved the increase by 1. I think most programmers know that 2n is the same as 1 << n. Our last problem is to store or set a small number in another number. Let's draw a little example to get a faster understanding of the problem with a random number:

1.0111.0111.0010.00012

Now we create a mask and use it to reset our gap to zero:

1.1111.1111.0000.11112

...then OR the result together with our left shifted 10102:

0.0000.0000.1010.00002

Creating the mask is a optional step to get a clean initial situation for the rest, but I really recommend the usage to prevent wrong results. Creating the mask is the complicated step here. So how would we describe it? Let's use the same approach as we already used for the getting. Then invert the mask and Ta-dah! A simple implementation as MySQL function to set the number m in the range x to y could look like this:

CREATE FUNCTION setint(n INT UNSIGNED, x INT UNSIGNED, y INT UNSIGNED, m INT UNSIGNED)
  RETURNS INT UNSIGNED
    DETERMINISTIC
RETURN n & (~(((2 << (y - x)) - 1) << x)) | (m << x);

If you don't want to understand the details of this stuff, and only want to use it, here is a little example that is not so abstract. We use a random number at first - let's say 4283942 and want to know, what number is stored in the bit range between 4 and 8:

mysql> SELECT GETINT(4283942, 4, 8);
+-----------------------+
| getint(4283942, 4, 8) |
+-----------------------+
|                     2 |
+-----------------------+
1 row in set (0.00 sec)

The binary representation of our random number looks like this:

100.0001.0101.1110.0010.01102

The highlighted range is 000102, or better 210. Let's store the decimal number 1010 in the range and overwrite the initial value 210.

mysql> SELECT SETINT(4283942, 4, 8, 10);
+---------------------------+
| setint(4283942, 4, 8, 10) |
+---------------------------+
|                   4284070 |
+---------------------------+
1 row in set (0.00 sec)

We could translate the number to binary to check the result, but we have a better way to do the check with getint():

mysql> SELECT GETINT(4284070, 4, 8);
+-----------------------+
| getint(4284070, 4, 8) |
+-----------------------+
|                    10 |
+-----------------------+
1 row in set (0.00 sec)