MySQL replication timeout trap
Today I spent several hours trying to find a problem in our application until I found out there was a problem on the MySQL side. In our setup we have several slave machines that replicate data from a master server. The slaves are configured with a series of
replicate-do-table directives in the
my.cnf file so that only parts of the schema get replicated. The remaining tables are modified locally, so to avoid conflicts they are not updated with data coming from the master.
We do however have the need to replicate data from the master for some special-case tables. To solve this we usually have a column that indicates whether a record was created on a master or a slave machine and use an appropriate
WHERE clause in all updates. This avoids collisions in concurrent
INSERTs on the master and the slave. The application knows which of the two records to use.
Due to historical reasons I do not want to elaborate on we did not use such a distinction column for one table (let's call it
T1). Instead we created a new table
T2 that stores data generated on the slave.
T2 is only written to when the slave is separated from the network. As soon as it gets reconnected, the data from
T2 is sent to an application server which merges it with the master's
This usually ensures that
T1 is up to date on the slave, too (with some seconds lag, of course), through replication.
However a customer noticed that records that were inserted into
T2 and later sent to the application server did not show up in the slave's
T1 table, even after several minutes, leaving the application with very out-of-date information.
Assuming some sort of replication error I connected to the slave and issued a
SHOW SLAVE STATUS command (I dropped uninteresing rows of the output):
*************************** 1. row *************************** Slave_IO_State: Waiting for master to send event Master_Host: MASTER001 Master_Log_File: MASTER001.009268 Read_Master_Log_Pos: 68660485 Relay_Log_File: SLAVE005-relay-bin.000071 Relay_Log_Pos: 27920448 Relay_Master_Log_File: MASTER001.009268 Slave_IO_Running: Yes Slave_SQL_Running: Yes Replicate_Do_Table: ... Replicate_Wild_Ignore_Table: ... Last_Errno: 0 Skip_Counter: 0 Exec_Master_Log_Pos: 68660485 Relay_Log_Space: 162138564 Seconds_Behind_Master: 0 1 row in set (0.00 sec)
As you can see both the
Slave_IO thread and the
Slave_SQL threads are running. So no replication problem here, everything is fine. Or so it seems.
Suspecting a problem with our application code I went through it, line by line, because at certain points it issues
STOP SLAVE and
START SLAVE commands. However I could not find anything I would not have expected.
However what I found when re-running the
SHOW SLAVE STATUS several times in the course of a few minutes made me curious: It did not show any changes to the position in the master- or relay log files. This effectively means that no updates are replicated from the master, even though the everything appears to be alright.
Even then I did not have the right idea but suspected some sort of bug in the old 4.1.12 release of MySQL we are using. But upgrading the slave to the latest 4.1.21 release did not solve the problem either. It could still be easily reproduced by unplugging the cable, creating some data in
T2, reconnecting and waiting for the data to be inserted into
T1 on the master but not the slave.
I only got it, when I saw that a simple
STOP SLAVE; START SLAVE fixed the problem. (I know, should have had this idea earlier...).
The reason for this strance behaviour is the default setting for the
slave-net-timeout variable (3600s). Once a slave has connected to its master it waits for data to arrive (
Slave_IO_State: Waiting for master to send event). If it does not receive any data within
slave-net-timeout seconds it considers the connection broken and tries to reconnect. To prevent frequent reconnects when there is little activity on the master this setting defaults to one hour.
However it does not reliably notice a disconnect that occurs less than one hour after the last replicated statement. That is exactly what happened: Unplugging the network cable broke the master-slave connection, however the slave did not notice and therefore still displayed a "no problems" status. Had I waited for an hour I would have seen the data arrive alright...
So to get quicker updates I just had to decrease the timeout value in the slave's config file:
slave-net-timeout = 30 master-connect-retry = 30
Because in our case the activity on the master side is quite heavy, 30s without any updates is plenty and most certainly an indication for a dropped network connection. With these settings the slaves notice a broken connection within half a minute and immediately try to reconnect. If the network is still down they will keep trying in 30 second intervals (
So lesson learned: do not trust the output of
SHOW SLAVE STATUS unconditionally!