Wikipedia is the greatest encyclopedia which has ever existed, because everyone can contribute to the massive knowledge corpus. Analyzing this data with computers is becoming more and more indispensable, as nobody can survey the information by hand anymore. In order to work with the data, we have to import it into MySQL and here is how it works.
I'll show how to do this on Debian Jessie, but it should be easily adaptable to other distributions.
Installing Percona Server
If you already have a proper MySQL set-up, you can skip this section. I use the Percona Server - a MySQL Fork with performance in mind - almost everywhere as it has nice features built in, like Handlersocket or TokuDB and a well tuned InnoDB engine. But, let's get started.
Install Percona release information
wget https://repo.percona.com/apt/percona-release_0.1-3.jessie_all.deb dpkg -i percona-release_0.1-3.jessie_all.deb
Install MySQL server
apt-get update apt-get install percona-server-server-5.7
Add the following lines into the
[mysqld] section of
innodb_file_per_table collation_server = utf8_general_ci character_set_server = utf8 skip-character-set-client-handshake
Download the german Wikipedia corpus:
or similarly the english:
MWDumper, which we use to translate the XML files to actual SQL statements:
Log in into MySQL and create a new database:
CREATE DATABASE wiki;
Now download and install the table schema from my gist. There is an official mediawiki table layout, which however had errors, like
ERROR 1054 (42S22) at line 84: Unknown column 'page_counter' in 'field list':
wget https://gist.githubusercontent.com/infusion/3c5007c73410b3fea3de76a10628c31e/raw/8fef537d997088f16a12296e2db64588cbb5c571/create-mediawiki.sql mysql wiki < create-mediawiki.sql
And finally import the actual Wikipedia data:
bunzip2 -c enwiki-latest-pages-articles-multistream.xml.bz2 | java -jar mwdumper.jar --format=sql:1.5 | mysql wiki
Finally you can remove enwiki-latest-pages-articles-multistream.xml.bz2 to clean up:
You might also be interested in the following
- SQL-modes right to exist
- MySQL Infusion UDF for statistical analysis
- MySQL - My new playground
- Optimized Pagination using MySQL
Sorry, comments are closed for this article. Contact me if you have an inventive contribution.