Monday, October 02, 2006

MySQL replication timeout trap

Today I spent several hours trying to find a problem in our application until I found out there was a problem on the MySQL side. In our setup we have several slave machines that replicate data from a master server. The slaves are configured with a series of replicate-do-table directives in the my.cnf file so that only parts of the schema get replicated. The remaining tables are modified locally, so to avoid conflicts they are not updated with data coming from the master.

We do however have the need to replicate data from the master for some special-case tables. To solve this we usually have a column that indicates whether a record was created on a master or a slave machine and use an appropriate WHERE clause in all updates. This avoids collisions in concurrent INSERTs on the master and the slave. The application knows which of the two records to use.

Due to historical reasons I do not want to elaborate on we did not use such a distinction column for one table (let's call it T1). Instead we created a new table T2 that stores data generated on the slave. T2 is only written to when the slave is separated from the network. As soon as it gets reconnected, the data from T2 is sent to an application server which merges it with the master's T1 table.

This usually ensures that T1 is up to date on the slave, too (with some seconds lag, of course), through replication.

However a customer noticed that records that were inserted into T2 and later sent to the application server did not show up in the slave's T1 table, even after several minutes, leaving the application with very out-of-date information.

Assuming some sort of replication error I connected to the slave and issued a SHOW SLAVE STATUS command (I dropped uninteresing rows of the output):

*************************** 1. row ***************************
             Slave_IO_State: Waiting for master to send event
                Master_Host: MASTER001
            Master_Log_File: MASTER001.009268
        Read_Master_Log_Pos: 68660485
             Relay_Log_File: SLAVE005-relay-bin.000071
              Relay_Log_Pos: 27920448
      Relay_Master_Log_File: MASTER001.009268
           Slave_IO_Running: Yes
          Slave_SQL_Running: Yes
         Replicate_Do_Table: ...
Replicate_Wild_Ignore_Table: ...
                 Last_Errno: 0
               Skip_Counter: 0
        Exec_Master_Log_Pos: 68660485
            Relay_Log_Space: 162138564
      Seconds_Behind_Master: 0
1 row in set (0.00 sec)

As you can see both the Slave_IO thread and the Slave_SQL threads are running. So no replication problem here, everything is fine. Or so it seems.

Suspecting a problem with our application code I went through it, line by line, because at certain points it issues STOP SLAVE and START SLAVE commands. However I could not find anything I would not have expected.

However what I found when re-running the SHOW SLAVE STATUS several times in the course of a few minutes made me curious: It did not show any changes to the position in the master- or relay log files. This effectively means that no updates are replicated from the master, even though the everything appears to be alright.

Even then I did not have the right idea but suspected some sort of bug in the old 4.1.12 release of MySQL we are using. But upgrading the slave to the latest 4.1.21 release did not solve the problem either. It could still be easily reproduced by unplugging the cable, creating some data in T2, reconnecting and waiting for the data to be inserted into T1 on the master but not the slave.

I only got it, when I saw that a simple STOP SLAVE; START SLAVE fixed the problem. (I know, should have had this idea earlier...).

The reason for this strance behaviour is the default setting for the slave-net-timeout variable (3600s). Once a slave has connected to its master it waits for data to arrive (Slave_IO_State: Waiting for master to send event). If it does not receive any data within slave-net-timeout seconds it considers the connection broken and tries to reconnect. To prevent frequent reconnects when there is little activity on the master this setting defaults to one hour.

However it does not reliably notice a disconnect that occurs less than one hour after the last replicated statement. That is exactly what happened: Unplugging the network cable broke the master-slave connection, however the slave did not notice and therefore still displayed a "no problems" status. Had I waited for an hour I would have seen the data arrive alright...

So to get quicker updates I just had to decrease the timeout value in the slave's config file:

slave-net-timeout = 30
master-connect-retry = 30

Because in our case the activity on the master side is quite heavy, 30s without any updates is plenty and most certainly an indication for a dropped network connection. With these settings the slaves notice a broken connection within half a minute and immediately try to reconnect. If the network is still down they will keep trying in 30 second intervals (master-connect-retry).

So lesson learned: do not trust the output of SHOW SLAVE STATUS unconditionally!

5 comments:

Anonymous said...

Dear, thanks a lot for this consideration, i found myself in the exact same situation. Helped a lot.

Thanks from Brazil.

Victor Diago
vdiago@gmail.com

Anonymous said...

Thanks a lot, it ended a headache off 2 days

Anonymous said...

That is exactly I searched for. Thanks a lot, Daniel!

Tuan Le said...

Hi,

Thank you for pointing out this issue. I am having it while deploying MASTER-SLAVE MariaDB in different regions on Amazon EC2.

Anonymous said...

Amazing! It works now... THANKS!