raw database

Flag based COUNT using MySQL

Robert Eisele

I was asked for help in optimizing a MySQL query where flags are stored in a database and references should be counted based on the flag value. Sounds not complicated at first, but there are several flags that should be counted and also just once per reference. A lot of food for GROUP BY you may think. Having said this, search and group for flags in this table would be really slow due to a very poor cardinality. But let's start with the actual problem. The example is fictitious, but I did my best to find a general use case for this problem.

Let's imagine we have a table with translators. Each of these translators is fluent in several languages. Whether he speaks a language is kept in the user/translator table as a TINYINT column which only gets the values 0 or 1. So, each of these translators now receives a document, which he translates into all languages he speaks. Translators with the same language can help each other. The question is what is the total number of translations made, if all translators finished the work on their documents. The number of languages is relatively constant and will not be changed. A little illustration would look something like this:

...
Stephen speaks fr, de, es
      document:1      -> + 3 translations
      document:2      -> + 3 translations
      document:3      -> + 3 translations

Martin speaks de, es
      document:2      -> + 0 translations
      document:4      -> + 2 translations

Ronald speaks fr, pl, it
      document:1      -> + 2 translations
      document:5      -> + 3 translations
=========================================
                  result: 16

This means Stephen made 9 translations on 3 documents. Martin made 4 translations on 2 documents, where 2 were already counted with Stephen. The same for Ronald, respectively. At the end we get a sum of 16 documents in 5 different languages. Let's bring this information into a database:

CREATE TABLE user (
   id int unsigned AUTO_INCREMENT KEY,
   name varchar(64) NOT NULL,

   lang_de tinyint unsigned not null,
   lang_en tinyint unsigned not null,
   lang_fr tinyint unsigned not null,
   lang_es tinyint unsigned not null,
   lang_be tinyint unsigned not null,
   lang_pl tinyint unsigned not null,
   lang_cz tinyint unsigned not null,
   lang_nl tinyint unsigned not null,
   lang_it tinyint unsigned not null
);

CREATE TABLE document_user (
   user_id INT UNSIGNED NOT NULL,
   document_id INT UNSIGNED NOT NULL,
   KEY(user_id),
   KEY(document_id)
);

INSERT INTO user (id, name, lang_fr, lang_de, lang_es, lang_pl, lang_it) VALUES
(1, 'Stephen', 1, 1, 1, 0, 0),
(2, 'Martin',  0, 1, 1, 0, 0),
(3, 'Ronald',  1, 0, 0, 1, 1);

INSERT INTO document_user (user_id, document_id) VALUES
(1, 1),(1, 2),(1, 3),
(2, 1),(2, 4),
(3, 1),(3, 5);

As I don't like several columns for flags because of it's fixedness, I'll add a new column which unifies all the flags as bit-flags into one integer column, which in turn makes the further steps really simple:

ALTER TABLE user ADD lang INT UNSIGNED NOT NULL;

The following Horner scheme-like query is sufficient in order to combine the old flags within the new column:

UPDATE user SET lang = (((((((
(
  (
    (
      (
        (
          (
            (
              (lang_it << 1) | lang_nl
            ) << 1) | lang_cz
          ) << 1) | lang_pl
        ) << 1) | lang_be
      ) << 1) | lang_es
    ) << 1) | lang_fr
  ) << 1) | lang_en
) << 1) | lang_de;

After droppping the old columns our user table looks like this:

+----+---------+------+
| id | name    | lang |
+----+---------+------+
|  1 | Stephen |   13 |
|  2 | Martin  |    9 |
|  3 | Ronald  |  292 |
+----+---------+------+
3 rows in set (0.00 sec)

Based on our new database schema, the task is relatively easy to solve, just ORing all elements together and count the number of bit's set:

SELECT document_id, BIT_COUNT(BIT_OR(lang))
FROM document_user
JOIN user ON user.id=user_id
GROUP BY document_id;

Which gives us this output, means the number of translations per document:

+-------------+-------------------------+
| document_id | BIT_COUNT(BIT_OR(lang)) |
+-------------+-------------------------+
|           1 |                       5 |
|           2 |                       3 |
|           3 |                       3 |
|           4 |                       2 |
|           5 |                       3 |
+-------------+-------------------------+
5 rows in set (0.00 sec)

The last step is summing up the number of documents. This could be done in the script language of your choice or better:

SELECT SUM(tmp) documents
FROM(
   SELECT document_id, BIT_COUNT(BIT_OR(lang)) tmp
   FROM document_user
   JOIN user ON user.id=user_id
   GROUP BY document_id
)X;

And we get our expected result:

+-----------+
| documents |
+-----------+
|        16 |
+-----------+
1 row in set (0.00 sec)

The final query looks pretty simple and makes good usage of the underlying index structure. A more obvious solution would be adding a new table which normalizes the problem. However, this way you'll keep your database busy with joins and group-by's. That's why I don't understand why Drizzle abolished binary functions and you only can go the official way. Finding the easist solution to a given problem is one of the key factors to operate an economic system. Nuff said :)