raw database

You often see databases with huge dynamic text fields, such as VARCHAR(255), TEXT, or as I recently was allowed to see the blanket use of LONGTEXT (max 4GiB) in order to be invulnerable from all contingencies. Things getting even worse when an index is used over such columns, because hey, there is an index. It makes things fast :-) Okay, jokes aside. Often you can save a lot of space and time, MySQL spends traversing the index when using a proper column type and index size.

Suppose we have a table like the following:

CREATE TABLE post(
  PID int unsigned auto_increment key,
  PTitle varchar(64) not null,
  PText mediumtext not null,
  key(PTitle)
);

What would be the optimal index size of PTitle if we want to make the column quickly locatable? Well, it depends on the case of application. I built a small auxiliary query to determine the distinct character frequency of a column. Is it an already existent table with a good amount of data, you can use it directly as a basis of your analyzis. If not, you only can rely on experiences or use a table with a similar content.

SELECT chars, 
   count( DISTINCT substr( PTitle, 1, chars ) ) number, 
   round( count( DISTINCT substr( PTitle, 1, chars ) ) / total * 100, 2 ) cover
FROM post
JOIN (
   SELECT @x:= @x + 1 chars
   FROM information_schema.tables
   JOIN (
      SELECT @x:= 0
   )X
   LIMIT 32
)Y
JOIN (
   SELECT COUNT(*) total FROM post
)Z
GROUP BY chars;

I use the same trick again here to generate a virtual table with information_schema.tables, I already mentioned in the article about efficient pagination with MySQL. For my weblog, the result looks like this:

+-------+--------+--------+
| chars | number | cover  |
+-------+--------+--------+
|     1 |     18 |  31.58 |
|     2 |     36 |  63.16 |
|     3 |     45 |  78.95 |
|     4 |     48 |  84.21 |
|     5 |     51 |  89.47 |
|     6 |     51 |  89.47 |
|     7 |     54 |  94.74 |
|     8 |     55 |  96.49 |
|     9 |     55 |  96.49 |
|    10 |     55 |  96.49 |
|    11 |     57 | 100.00 |
|    12 |     57 | 100.00 |
|    13 |     57 | 100.00 |
|    14 |     57 | 100.00 |
|    15 |     57 | 100.00 |
|    16 |     57 | 100.00 |
|    17 |     57 | 100.00 |
|    18 |     57 | 100.00 |
|    19 |     57 | 100.00 |
|    20 |     57 | 100.00 |
|    21 |     57 | 100.00 |
|    22 |     57 | 100.00 |
|    23 |     57 | 100.00 |
|    24 |     57 | 100.00 |
|    25 |     57 | 100.00 |
|    26 |     57 | 100.00 |
|    27 |     57 | 100.00 |
|    28 |     57 | 100.00 |
|    29 |     57 | 100.00 |
|    30 |     57 | 100.00 |
|    31 |     57 | 100.00 |
|    32 |     57 | 100.00 |
+-------+--------+--------+
32 rows in set (0.02 sec)

The inner query could be simplified even more by using the new ROW_NUMBER() function, I've introduced with my Infusion UDF:

SELECT ROW_NUMBER() chars
FROM information_schema.tables
LIMIT 32

Anyway, as you can see in the result, an index with the size of 11 would be entirely sufficient. In order to leave a little room for growth, we take 13:

ALTER TABLE post DROP INDEX PTitle, ADD INDEX(PTitle(13));

However, an index with 13 bytes is clearly too much, considering the total number of 57 articles. I recently had the problem of having to take a 19-byte index for a JOIN, with each table of several million records - which was extremely slow in the end. Another problem are also columns that are not as generic as a title. What if the diversity of a column is only expressible by the length or if the string only differs in a few characters? We use a hash function. MySQL actually has only two hash functions, SHA and MD5. The output of MD5 is a 128-bit string, that are 32 characters in hex, and 16 bytes as binary. So we add a new column to the table to store the hash value:

ALTER TABLE post ADD PHash BINARY(16) NOT NULL, ADD INDEX(PHash);

And write future posts like this into the table:

INSERT INTO post SET PTitle='...', PText='...', PHash=unhex( md5( PTitle ) );

With a small data set as my blog post table collisions, however, are quite unlikely, so the result of the checksum function CRC32() would also be sufficient in order to reduce the column and index size ones more. Unfortunately, MySQL doesn't have a 64 bit hash function, but Maatkit's attached UDF implements a FNV algorithm, which I used as the basic idea for the FNV() function of my Infusion UDF.