raw database

Fulltext GEO Search with MySQL

Robert Eisele

Probably the most debated topic when it comes to geo calculations is determination of distance between two points and to find locations within a certain radius. There are many other beautiful things that you can do with coordinates. I hope, that I can work towards the trend of focusing on only a small range of calculations if the name of the project is not quite google maps. But I have to do research for new calculations so that I will start with a more easy thing namely a fulltext search for user friendly location submission here. The big aim should be to provide a little search engine that can easily handle zip codes, citys and regions - conditional on what the user knows. It should work as, the more information the user enters, the more precise the search will be. In the following I will concentrate on the use of MySQL fulltext indexes, what does not mean, that you couldn't use Sphinx or Lucene for that job.

For simplicity, we only take two, in German speaking countries usual, hierarchies: cities / towns and federal states / cantons. These can be stored in the following normalized database structure:

CREATE TABLE federal (
  GFID smallint unsigned AUTO_INCREMENT KEY,
  GC_ID tinyint unsigned NOT NULL,
  GFLabel varchar(48) NOT NULL,
  KEY(GC_ID)
);

CREATE TABLE town (
  GTID smallint unsigned AUTO_INCREMENT KEY,
  GF_ID smallint unsigned NOT NULL,
  GTLabel varchar(48) NOT NULL,
  GTLat double NOT NULL,
  GTLon double NOT NULL,
  KEY GF_ID (GF_ID)
);

CREATE TABLE zip (
  GZID smallint unsigned AUTO_INCREMENT KEY,
  GT_ID smallint unsigned NOT NULL,
  GZLabel varchar(8) NOT NULL,
  KEY GT_ID (GT_ID)
);

Okay, I'm nice, I quickly drew an ERD for a faster understanding of the relationship between the relations:

To build a search index, we create an additional table in which we will write a optimized version of the search content.

CREATE TABLE lookup (
  GT_ID int(10) unsigned KEY,
  GTLabel varchar(96) NOT NULL,
  GTNDX text NOT NULL,
  FULLTEXT KEY GTNDX (GTNDX)
);

Now it is just to write the query to build up the table content from our raw data. It doesn't matter how the query looks like or how fast it is. The operation must be performed only once.

INSERT INTO lookup
SELECT GTID, IF(GTLabel <> GFLabel, CONCAT(GTLabel, ', ', GFLabel), GTLabel), 
CONCAT(
  IF(GTLabel <> GFLabel, CONCAT(GROUP_CONCAT(DISTINCT GTLabel SEPARATOR ' '), ' ', GROUP_CONCAT(DISTINCT GFLabel SEPARATOR ' 

')), GROUP_CONCAT(DISTINCT GTLabel SEPARATOR ' ')), ' ',
  GROUP_CONCAT(GZLabel SEPARATOR ' ')
)

FROM town
JOIN federal ON GFID=GF_ID
JOIN zip ON GTID=GT_ID
GROUP BY GTID

Actually, that was it. We are able to perform searches because the rest is done inside of MySQL using the fulltext extension of MyISAM. Let's write a little query:

SELECT GT_ID, GTLabel, MATCH (GTNDX) AGAINST ('hogwarts*') GTScore FROM lookup ORDER BY GTScore DESC

If you try a couple of search phrases it will state out, that the bigger the city the imprecise the result. But what is the reason, that a search for "Berlin" or "München" only finds towns around these major citys, which names are "XY bei München", "ABC bei Berlin" and so on? This is not a negligible problem, because most peoples live in larger towns or cities, so that the usability for the most of our users will be unacceptable. That's why we should try to optimize the ranking. The problem with big cities is, that they have a lot of zip codes which inflate the index and when it comes to the calculation of the overall rank by dividing by a length indicator, the value of larger cities shrinks down. But exactly what makes the problem can be used as an indicator for the population, because larger towns and citys have more than one zip. Hence we add a new column to our index table and delete the contents to become the same initial situation as above:

TRUNCATE lookup;
ALTER TABLE lookup ADD GTPop float NOT NULL;

Now we fill the index table again with the same query as above, only extended by the natural logarithm of the number of zip codes.

INSERT INTO lookup
SELECT GTID, IF(GTLabel <> GFLabel, CONCAT(GTLabel, ', ', GFLabel), GTLabel), 
CONCAT(
  IF(GTLabel <> GFLabel, CONCAT(GROUP_CONCAT(DISTINCT GTLabel SEPARATOR ' '), ' ', GROUP_CONCAT(DISTINCT GFLabel SEPARATOR ' ')), GROUP_CONCAT(DISTINCT GTLabel SEPARATOR ' ')), ' ',
  GROUP_CONCAT(GZLabel SEPARATOR ' ')
),
<strong>LN(COUNT(*) + 2)</strong>

FROM town
JOIN federal ON GFID=GF_ID
JOIN zip ON GTID=GT_ID
GROUP BY GTID

Now we have to extend the SELECT statemanent as well:

SELECT GT_ID, GTLabel, MATCH (GTNDX) AGAINST ('hogwarts*') <strong>* GTPop</strong> GTScore FROM lookup ORDER BY GTScore DESC

With the product of both, our population indicator and the match weight we give bigger cities their deserved weight to rank better than some one-horse towns. A nice spin-off for our new ranking is, that the ranking has also a positive impact for searches on federal states only. A search for "Bayern" returns the biggest citiys in that federal state: München, Nürnberg, Augsburg, ...