Thursday, March 08, 2007

"Streaming" MySQL slave deployment

In August 2006 I made a post about the MySQL blackhole engine and its possible use for replication scenarios.

Back then the primary focus was to reduce the amount of data being transferred between master and possible many slaves during normal business. This speeds up processing on the slaves, because they have considerably less data to first store to their relay logs and then immediately discard again.

What still takes a lot of time is the initial setup of a slave system. This is because there is a potentially large SQL based dump file to be executed on maybe relatively low-end systems (e. g. Celeron CPUs and single IDE hard drives). While this can be made to work reliably with a set of scripts to automate the process of downloading a zipped SQL file from a local server and running it against the clients' databases, it still is a lengthy process. One setup I know about needs somewhere in the vicinity of 30 minutes to set up a slave. So I was looking for a way to speed it up.

The solution is based on the fact that you can quite easily copy MySQLs data files (including InnoDB table space files) from one machine to another. So my idea was to first create a ready-to-go set of slave data files and transfer them to multiple slaves. Usually those files will be bigger than an SQL file previously used, but no parsing and processing is required on the target system. This makes the whole thing mostly network and linear disk I/O bound.

For the impatient: I was able to get the setup time down to around 3 minutes on the same hardware. Read on if you want to know how.

The basic idea behind this concept is to relieve the individual slaves from importing SQL files themselves. As their hardware is rather slow this can be very time consuming. So why not have a single "power-slave" - one that has more powerful hardware - handle the process of importing and then provide any number of identical slaves with its data files. This will reduce the burden of the other slaves to unpacking.

While this does not really save anything in terms of bytes that need to be written to each slave's disk, the access pattern is much more sensible. The following table compares the disk transfers for a "regular" SQL import from local disk and the proposed alternative for a 60MB gzipped SQL file which will lead to approximately 2GB of InnoDB table space files:

Regular ImportPrepared data files
linear write 60MB download to local diskDownload 60MB, directly streamed to linearly written 2GB data files
linear write 2GB initial creation of InnoDB data filesn/a
linear read 60MB SQL.gz, interleaved with random write 2GB to data filesn/a
4120 MB total bytes read/written randomly2GB linear write

Importing an SQL file from the local hard disk means there are permanent seeks between the current position in the SQL text file and the server's data files. Moreover as the database schema may define lots of indexes there is even more random disk write activity when executing simple INSERTs.

In contrast to that unpacking ready-made InnoDB table spaces (or MyISAM table files for that matter) is basically just linear writing. Even simple IDE drives can handle that pretty fast these days.

How to...

As the process of creating a suitable SQL dump is rather simple I will start from there. These are the steps taken in sequence:

  1. Set up a fresh temporary MySQL daemon with a configuration similar to the clients

  2. Dump the data from the master with the --master-data option and feed it into the temporary server

  3. Shutdown the temporary server and compress its data

  4. Transfer the archive to the slaves and unpack

  5. Adapt the slaves config files and make some minor adjustments

  6. Run the slaves and let them connect and catch up with the master

I will go into a little more detail on the points above now.

Temporary Daemon

While we do not need all of the slaves to process the SQL dump individually, of course it still needs to be executed at least once. So what we do is set up a MySQL daemon with a stripped down configuration that is close to the actual slaves'. Close in this case means all parameters that affect the storage files must be matched to the slaves, because the files must be usable there later.

Anything concerning buffers sizes etc. can be different, of course.

Every time you want to prepare a new slave installation "image", the temporary daemon should to be started with an empty data directory. This is particularly important for InnoDB data files as those are a little more sensitive to moving around. While it is not strictly necessary to delete the table space and transaction log files every time I prefer to do so, because it allows for better compression rates later if you do not need all the available space.

When using InnoDB (as I usually do) there are two strategies how to configure the size of the table space:

  • Create the files with a size that is definitely sufficient to hold all the data you import from the master and a little reserve. To use this approach either requires you to know somewhat precisely what amount of data they will have to hold to not oversize them too much. This may (but need not) include data you know will be added later by replication.

  • Create the files close to the size that will be needed - maybe a little more to prevent the need of autoextension. In this case you should nevertheless specify the last data file to be autoextending. Otherwise the process of importing the SQL data may lead to filling the table space prematurely. Especially when used in an automated process this may be undesirable.

I usually use the second approach. The table space I define is normally enough, but sometimes - due to fluctuations in the live data - it needs to be autoextended somewhat. I recommend configuring InnoDB to add larger chunks to the autoexending file if needed (default: 8MB). As extending the files is associated with some overhead I normally use 50MB or 100MB. The bigger this is, the less often InnoDB will have to extend the file. See the manual section on InnoDB configuration for more info on this.

Dump master data

Once you have the temporary daemon running, use the mysqldump tool with the --master-data and --single-transaction options to create a dump of the database(s) you need to be replicated. In order to save time and disk space you may find it useful to pipe the output directly through the mysql command line client and feed it into the target server. As I usually filter out tables I do not need on the slave, the dump file would not be good for backup purposes anyway.

Shutdown & Compress

Now shutdown the temporary server. Compress the data directory. Depending on how you want to configure permissions you may include or exclude the "mysql" schema. I usually have the temporary server set up with as low permissions as possible and do not move the "mysql" schema along.

To allow for my approach on the slaves you cannot use the ZIP format. Because I do not want to first download and then uncompress the archive on the slaves to save disk I/O, the format of a ZIP archive is unsuitable. It contains a list of files included at the end, so piping it through a decompression program on the fly will not work. Instead I use a gzipped tarball. This allows to download and pipe the data stream through gunzip before directing it to disk.

Transfer to a slave and uncompress

This is where it becomes interesting. Up to now we have merely set up a mysql daemon and imported a dump into it. Tarring up the data dir was not very special either.

On the slave I suggest curl as a download tool. I have not tried with wget, but that might work as well. However it is important that the tool you choose be able to output the downloaded file directly to standard out. Otherwise it could not be be piped through the tar command. With curl that is quite simple, because it is the default behavior. It also handles files larger than 2GB which wget has some problems with, if I remember correctly. The command line should look similar to this:

curl http://the.server/mysql_data.tgz | ( cd /the/target/datadir ;  tar xzf - )

Curl will download the file and pipe it to tar to decompress. Because tar cannot unpack to anywhere but the current directory you may either first switch to the designated target directory and then download or - as I do - invoke tar in a sub-shell that first switches there. This has the advantage of not having to jump around the filesystem in the main script handling slave setup.

You will find that on a local area network downloading and unpacking will be considerably faster than having MySQL first create the empty data files and then import the SQL for the reasons stated above.

Adjust slave configuration

When the data files have reached their destination on the slave, you may need to adjust the slave settings a little. This especially depends on whether you copied fixed size data files - in which case you can prepare the config file in advance - or used the autoextend option on the last table space file. As the latter case applies to me I wrote a little script that takes a template my.cnf file and replaces some placeholders via sed. One of those is the size of the last InnoDB data file from the archive. It will become a fixed size file on the slave. Another file will then be added at the first slave start.

E. g. if the archive from the temporary daemon contains two files ibdata1:1024M;ibdata2:1024M:autoextend I use a template line like ibdata1:1024M;ibdata2:#IBD2SIZE#M;ibdata3:512M:autoextend and replace the #IBD2SIZE# part with the actual size of the file in the archive. This could be done even more dynamically by building the whole line via script, so you do not have to know in advance how many files and of which size you get from the master.

Be careful not to use the du command to determine the file size to be used in the config file. Because du shows the actual amount of space taken on the disk this will usually differ from the actual file size. This is what I use:

cp my.cnf.tmpl /etc/my.cnf
ibd2size = $( wc -c ibdata2 | awk " { print $1 / 1024 / 1024 } " )
sed -i -e "s/#IBD2SIZE#/${ibd2size}/" /etc/my.cnf

First the the template gets copied to the place the daemon startup scripts will expect it to be (be sure the slave server is not running).

Second wc -c tells you the number of characters/bytes the ibdata2 file really takes. awk then just divides this number to get to megabytes which is needed for the config file.

Finally the placeholder is substituted with the number just stored to the ibd2size variable.

See 14.2.7. Adding and Removing InnoDB Data and Log Files in the manual for more details on this process.

One more thing needs to be addressed only if one of the machines is run on Windows and others on some flavor of Unix. Taking a look at the relay-log.index and relay-log.info files reveals that when created on a Windows master they will contain backslashes. This prevents a Unix mysqld to initialize replication properly. So if you deal with such a situation you will need to correct those, e. g. using sed again:

sed -i -e "s/\\/\//g" relay-log.in*

If master and slaves are running on the same platform this is not required.

Once you did all that have a final look at the file permissions, especially on Unix systems. Make sure the daemon will be able to access the data files.

You can now start the slave daemon. In case you configured it appropriately it should come up and await connections.

Connect to master

One last thing that needs to be done is read the master's current binlog file name and position from the master.info file. This is needed, because once the slave server has been started you will need to provide correct credentials for the replication user. Moreover you must tell the slave which master host to connect to. Unfortunately when issuing a CHANGE MASTER TO statement on the slave which defines a master hostname all information about previous master binlogs and the position to where they have already been applied is discarded. From the manual:

If you specify MASTER_HOST or MASTER_PORT, the slave assumes that the master server is different from before (even if you specify a host or port value that is the same as the current value.) In this case, the old values for the master binary log name and position are considered no longer applicable, so if you do not specify MASTER_LOG_FILE and MASTER_LOG_POS in the statement, MASTER_LOG_FILE='' and MASTER_LOG_POS=4 are silently appended to it.

(see manual chapter 13.6.2.1. CHANGE MASTER TO Syntax)

So you will need to tell the slave again where to begin replicating.

One possible solution is to read the contents of the master.info file that was brought along with the data files into an array and inject the values into the statement:

arr = ( $(cat master.info) )
mysql -e "CHANGE MASTER TO master_host='the.master.server', master_user='replication_user', master_password='the_password',
master_log_file='${arr[2]}', master_log_pos=${arr[3]}"

The format of the master.info file is described in the MySQL manual.

Final checks

This should be it. As soon as you issue a START SLAVE statement, the slave should connect to the master and begin to catch up whatever has happened since the point in time when the dump was taken. In case SHOW SLAVE STATUS tells you that the replication is not running, have a look at the daemon's error file to get a hint about what's going on.