Importing entire Wikipedia into MySQL

Wikipedia is the greatest encyclopedia which has ever existed, because everyone can contribute to the massive knowledge corpus. Analyzing this data with computers is becoming more and more indispensable, as nobody can survey the information by hand anymore. In order to work with the data, we have to import it into MySQL and here is how it works.

I'll show how to do this on Debian Jessie, but it should be easily adaptable to other distributions.

Installing Percona Server

If you already have a proper MySQL set-up, you can skip this section. I use the Percona Server - a MySQL Fork with performance in mind - almost everywhere as it has nice features built in, like Handlersocket or TokuDB and a well tuned InnoDB engine. But, let's get started.

Install Percona release information

wget https://repo.percona.com/apt/percona-release_0.1-3.jessie_all.deb
dpkg -i percona-release_0.1-3.jessie_all.deb

Install MySQL server

apt-get update
apt-get install percona-server-server-5.7

Add the following lines into the [mysqld] section of /etc/mysql/my.cnf:

innodb_file_per_table
collation_server        = utf8_general_ci
character_set_server    = utf8
skip-character-set-client-handshake

Downloading Wikipedia

Download the german Wikipedia corpus:

wget https://dumps.wikimedia.org/dewiki/latest/dewiki-latest-pages-articles-multistream.xml.bz2

or similarly the english:

wget https://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles-multistream.xml.bz2

Now download MWDumper, which we use to translate the XML files to actual SQL statements:

wget https://dumps.wikimedia.org/tools/mwdumper.jar

Log in into MySQL and create a new database:

CREATE DATABASE wiki;

Now download and install the table schema from the official mediawiki table:

wget <raw_link> -O create-mediawiki.sql

mysql wiki < create-mediawiki.sql

And finally import the actual Wikipedia data:

bunzip2 -c enwiki-latest-pages-articles-multistream.xml.bz2 | \
      java -jar mwdumper.jar --format=sql:1.25 | mysql wiki

Finally you can remove enwiki-latest-pages-articles-multistream.xml.bz2 to clean up:

rm enwiki-latest-pages-articles-multistream.xml.bz2

If you see an error like ERROR 1054 (42S22) at line 84: Unknown column 'page_counter' in 'field list' or similar, you should double check the --format=... parameter, if it is still the most current.

You might also be interested in the following

2 Comments on „Importing entire Wikipedia into MySQL”

Robert

Hi Ronald,

interesting idea. Maybe I'll post a follow-up article with some queries I ran on the dataset. So far, I imported the german version as the english wikipedia wouldn't fit on my in-house servers. I have a test cluster here, but there running a lot of other jobs and for my purpose it's enough. Doing the same with the english wikipedia is a scaling thing only.

Some stats if you're curious anyway:

The zipped file I downloaded was 4GB. Generated queries out of the XML dump were about 19GB. The size of the database folder created by MySQL is about 25GB with "text" table being accountable for 23GB and holding 3832251 rows. The "revision" table is about 1,9GB and the rest is the "page" table (3832251 rows as well). As far as I know, the german wikipedia is the second largest, after the english one, but still a factor of 3 smaller. So multiplying the numbers above with 3 would give a good estimate of the expected amount of data.

Quite a lot of tables, which exist after running the mediawiki create statements, are empty. If you're interested in filling other tables, you would need to import other files. What I described in the article is the data for the actual articles, which is the biggest and most interesting portion.

Ronald Bradford

Could you share some of your findings after loading the wikipedia data?

How much disk space was needed, what are the varying table sizes?
What interesting queries can you run?

 

Sorry, comments are closed for this article. Contact me if you want to leave a note.