Sunday, April 09, 2006

Distaster Recovery

What's the difference between a backup and a restore? Right:

  • A backup is made to preserve a known-good state for later, after something has gone wrong (so called "disaster").

  • A restore is the backup being re-installed after something went wrong (so called "disaster recovery")

So far so good. But what about a restore that is installed when nothing went wrong and the server is just running nicely, handling lots of requests... Call that a "home made disaster". Yes: THAT'S REALLY BAD!

Last week we suddenly got lots of failure reports from one of our production systems (decentralized structure). People complained about all clients failing with somewhat random error messages. We first took the stacktraces and looked at the code in question to find out why it might suddenly start failing in places that had been working fine for months. We quickly realized that data corruption must be the cause. So the next step was to do some integrity checks on the database. It revealed that several hundred thousand records had vanished from a very important table (among other things, we only found out about later).

As our application does not provide any means of deleting records from that table, apart from deleting something else that has delete-cascade rules, we suspected either a hardware malfunction (raid controller, RAM etc.) or a severe bug in the application that triggered a lot of the cascading deletes.

In the end our theories turned out to be wrong, but let me elaborate a little further.

First thing we did was take the application server and the database offline and create a full copy of the MySQL data directory and the binlogs (the binary representation of every transaction that changed data in any way) just to be sure we could analyze later on. Then we began restoring the last full backup of the database from the night before. Importing this SQL dump took about two hours. After that we then tried to apply the binlogs in the correct order using the instructions from the MySQL documentation. Basically everything you do is pipe the binlog files through the mysqlbinlog tool and from there into the mysql command line client. The first step creates regular SQL statements that get then executed against the database.

Unfortunately this did not work very well. After a few seconds of work the database started to complain about lots of foreign key violations, because it could not find rows being referenced. It took us (and the MySQL support) some time to find out that the mysqlbinlog tool did not create a vital "set names utf8;" statement at the beginning of the transformation. That led to the data being interpreted as some variant of latin1 which in turn caused the broken references.

At that time we had to restore the full backup again (another two hours waiting), because we could not know how much of the binlogs had already been applied until the first visible errors occured. In the meantime we created one SQL file per binlog and prepended it with the "set names utf8;" statement to be able to replay them easier later.

When the full backup had finally been restored (again), we began applying the incremental binlogs (again...), this time without any problems. We restored the database to a point in time about half an hour before the error report. Just to be sure we took another full backup after that; in the meantime production had been resumed using some stand-alone systems that could perform the most critical tasks and gained us some time.

Still not knowing what had caused the problem (and still suspecting a cascading delete) we started rolling further forward, after each single binlog-SQL running the consistency checks again. We were very confident the error must be contained in the binlogs, because several read-only replication slaves had obediently recreated the corrupted database state.

We finally found the offending binlog; after it had been applied the consistency check failed. We took a closer look at the SQL generated from that binlog, hoping to find the delete statement that triggered the cascade. What we did find however was not a simple delete statement, but a much worse "set foreign_key_checks=0; drop table ..." statement. This really shocked us, because nothing like this is anywhere in the application code.

It turned out, that by accident somehow a restore-script had been triggered (we are still investigating the cause, but suspecting some sort of scheduled task the called the wrong script) and begun restoring a full backup in the middle of the day, knocking the application server of its feet.

In the end we were able to restore the data to a point in time about three minutes before the crash and identify what needs to be recreated manually.

Lesson learned: Do not place the restore script onto the production servers by default, but only bring them there when you really need them.

No comments: