Jazzy Coding

The Art of Structured Improvisation. By Carsten Nielsen

Tag Archives: big data

When Giants Step On Your Shoulder

Recently I had a job to migrate a MySQL-database of about 20M entries and a guessed size of about 10GB. At this project I stumbled over some issues regarding the speed, memory and disk-space on the servers. This article will hopefully give you some hints about tweaking a (MySQL-) migration process to make it faster and easier.

The database was used by a forum and some of the tables had to be converted and remapped to make them work with the new forum-application. Remapping means in this case for example, that we had to feed three tables of the new database with the composed and manipulated content of five tables of the old database.

One of the biggest quests was to keep the downtime (or read-only-time) as short as possible – so here are some experiences with MySQL in this case.

The migration-process itself had the following dead-simple-structure:

1. Set the production-system to read-only-mode
2. Copy the actual database to the new server
3. Run the migrations for the database

My first idea was to use PDO. I created a simple class and some functions to read the old data, convert it and store it in the new database.
Here is a sketch of the class:

class migrateJob
  public function __construct(array $source, array $target){}

  public function convertToHtml($text){}

  public function migrateUsers(){}
    echo "Importing Users:\n";

    //Setup some default-values for the user in the new db
    $user = array(
      //some default values for the user-dataset

    //Load the user info from two tables of the old database
    $old_data = $this->source->query("SELECT * FROM old_user_table, old_profile_table
                                      WHERE old_user_table.id = old_profile_table.user_id");

    //Prepare the storing statement
    $user_query = $this->target->prepare("REPLACE INTO new_user_table (".implode(",", array_keys($user)).")
                                          VALUES (:".implode(",:", array_keys($user)).");");

    //Iterate through the results and store them
    echo "\nStarting user-migration.\n";
    while($old_user = $old_data->fetch(PDO::FETCH_ASSOC))
      //Create a new user-array from the defaults.
      $new_user = $user;

      //Do the conversions
      $new_user['bio'] = $this->convertToHtml($old_user['profile']);

      //Copy the old values to the new user
      $new_user['id'] = $old_user['id'];
      $new_user['name'] = $user['login'];
      //... some more re-mappings

      //Store the new user
        //maybe nice to have a counter and some status output here.
        $err = $user_query->errorInfo();
        echo "\nError: ".$err[2]."\n";
    echo "\nDone.\n";

With this class the different migrations should run quite easy – I was partly wrong. When running this on the user-tables (about 50k entries) everything went fine. but when I tried to run the same on the forum-posts-table with about 20M entries, the process silently got slower and slower.
Hmmm. I got nagged by a memory leak in my PDO-implementation? I tried different things to get rid of it, but didn’t find any solution for this.

Next try was to create a script which used standard php-mysql-module. Okay, this should be simple. Connect to the database, read the data convert and save it.
I started old-school-style:

$old_db_connection = mysql_connect("localhost", $source_db['username'], $source_db['password'], TRUE);
$result = mysql_query("SELECT id,user_id,user_name,body,title,topic_id FROM posts", $old_db_connection);

$new_db_connection = mysql_connect("localhost", $target_db['username'], $target_db['password'], TRUE);

while($old_data = mysql_fetch_assoc($result))
  //do some conversions to the old _data and store it
  mysql_query("INSERT INTO posts ...", $new_db_connection);

Dammit! Same problem here: The queries got slower and slower.

A closer look at the problem shows that the reading process of the database tries to load the complete select-result-data into the memory and blocks the process sooner or later. An iterating “LIMIT 0,1000” reduces this a bit but then every new SELECT has to search the index to get the next position, which takes more and more time too.

The fix for this is mysql_unbuffered_query(). This fetches the result without buffering the large result-set automatically and saves a considerable amount of memory.

To speed up the query even a bit more we write-lock the table before sending any query, which prevents MySQL to check the integrety of the indexes while I’m processing the tables:

mysql_query('LOCK TABLES posts WRITE',$old_db_connection);

That seems to be everything  we can do for the read-process. Another try shows that the process – even with this leightweight optimization – takes hours. Then let’s check the write-actions too.

The writing process has nearly the same issues as the reading-queries. The used memory is rising query after query and to add new rows requires more and more time when MySQL tries to ensure the integrity of the indexes.

After some testing I found a fix for this – but that’s not that simple. The fastest option to add a large amount of data seems to be to pass a SQL-file with SOURCE directly to the database. This has the positive effect, that I can put several INSERT-queries into one statement – making the insert-process even faster.

INSERT INTO posts (id, user_id, body, title, date) VALUES
(1, 12, "Here the body", "Here the title.", "2011-01-05"),
(2, 14, "Here the next body", "Here the next title.", "2011-04-15"),

I splitted the INSERT-queries into chunks of 25 queries per file, every query inserting a maximum of 2.000 rows = 50.000 inserted rows per file.

I wanted to launch the import-process directly from the command-line so that I’m able to raise the max-allowed-package size so that MySQL don’t tries swap to the disk during the operations.
I stored the commands in a file pointing to the generated files:

SOURCE /path/to/generated-file-00.sql
SOURCE /path/to/generated-file-01.sql

Now we can import our stuff in the script

system("mysql --max_allowed_packet=256M -u 'username' -p'password' 'new_db' < import.sql");

As those are MyISAM we can even speed up the things a bit more by switching off the indexing. This option is only available if you have root access to the data-files of the MySQL-server.

stop mysql
cd /var/lib/mysql/new_db
myisamchk --keys-used=0 -rq posts.MYI
start mysql

After the migration process we have to switch it on again, which can take some time as the indexes are completely rebuild:

stop mysql
cd /var/lib/mysql/new_db
myisamchk -rq posts.MYI --sort_buffer_size=2G
start mysql

As you can see the process brought some special problems up – I think the mixed up parts of setting up this process by tweaking the queries, the database-connection and the server itself is somewhat picky.
I really wonder how this would look like with PostgreSQL! (@Magnus Hagander!)

After this the complete migration looks like this:

1. Lock the source table
2. Select the old data by using mysql_unbuffered_query().
3. Store the converted data as multiple-inserts into sql-files.
4. Create a sql-file that uses SOURCE to import the sql-files.
5. Deactivate the indexing of the MyISAM-tables
6. Pass the this sql-file with a raised –max_allowed_packet=256M to the mysql-client
7. Activate the indexing of the MyISAM tables again.

Good Luck with your next migration!