Sunday, October 29, 2006

Upgrade to Egdy Eft

Yesterday I read about the final release of Ubuntu 6.10, Edgy Eft. As for my good experiences with Dapper Drake I decided to upgrade. Having sort of a Debian history I was quite confident that a dist-upgrade would work quite flawlessly, especially since I had not made any deep modifications of my system (definitely a point for Ubuntu here! :)).

Preparing

I went to the Ubuntu homepage and read the Upgrade Notes. I had always wondered - however not really bothered to find out either - what this alternate install CD was. Now I know that it is used to save bandwidth in the upgrade process, because it can be used as a local repository source. I do however wonder, why this cannot be done with the regular install media. But anyways...

Upgrading

I ran the CD based upgrade as described in the upgrade notes. A graphical tool came up and asked me whether I wanted to download updates of more recent packages from the net. I said yes, suspecting that there would not be that many of them, as the whole distribution had just come out. However after a while analyzing my system I was told that about 250MB of newer packages would have had to be downloaded. I decided to abort here, because I had a bad feeling about being of of several thousands, hitting the repository server.

I re-ran the tool and this time said "no" when it asked me if I had a cheap/fast internet connection. Still the updater claimed it needed about 250MB of data, however I suspected this was just a badly formulated message that appeared no matter what. So I let it go from here (acknowledging the "point of no return" warning) this time.

A progress bar showed up and a label claiming 1117 packages needed to be fetched. The first 900 or so went very fast, seemingly from the mounted CD image. However then things started to become ugly. Looking at the process list revealed that apt was happily downloading packages from the net with an astounding speed of between 7 and 15kb(!)/s... Nothing fancy of course... Just OpenOffice, some GTK libraries, several dictionaries, the GIMP help files in German and English and so forth. All in all the upgrade from Dapper to Edgy took me around 9 hours, 8.5 of which where just used for downloads I had tried to avoid in the first place.

Afterwards it occurred to me that this alternate install CD did not contain everything I had installed, so I guess it just did not have any chance but to download those packages, however in that case I would have liked a DVD to download via BitTorrent first.

Manual cleanup

Once the upgrade tool was done, it rebooted the machine. When the Grub menu came up I had to manually choose the new kernel, because of my manually modified choice of configurations. This was ok. However for some reason I went through a text-mode boot process where I would have expected a nice usplash screen. Looking at the log later revealed that it complained about not having a configuration for 640x480. I have two DFPs, both 1280x1024, so I do not know why it would have tried the lower resolution. I added vga=794 to the kernel line in the menu.lst to resolve this. Once it worked I could see some nice artwork :)

Next I logged in and found myself unable to start a simple gnome-terminal. Choosing it from the Applications menu made a "Starting terminal..." appear in the task bar for a few seconds, but now terminal opened. Running xterm worked however. Googling the web I almost immediately found this entry in the Ubuntu Forums. Obviously it has something to do with the X11 configuration. Strangely enough this had always worked with Dapper.

What worked however (and without any further ado) was video playback, even with correct colors. I did not have the time to dig deeper into my wrong video colors problem, and as it seems this will not be necessary any more :)

Another thing I found to be not working was Azureus. It came up with its splash screen and almost immediately terminated again. Starting it from a terminal brought this up:

ds@yavin:~$ azureus 
changeLocale: *Default Language* != English (United States). Searching without country..
changeLocale: Searching for language English in *any* country..
changeLocale: no message properties for Locale 'English (United States)' (en_US), using 'English (default)'
#
# An unexpected error has been detected by HotSpot Virtual Machine:
#
#  SIGSEGV (0xb) at pc=0xb0527d02, pid=8613, tid=3085334192
#
# Java VM: Java HotSpot(TM) Client VM (1.5.0_08-b03 mixed mode, sharing)
# Problematic frame:
# C  [libglibjni-0.4.so+0x8d02]
#
# An error report file with more information is saved as hs_err_pid8613.log
#
# If you would like to submit a bug report, please visit:
#   http://java.sun.com/webapps/bugreport/crash.jsp
#
Aborted (core dumped)
ds@yavin:~$

I found some people on the net (see here and here, however they at least managed to get a stack trace. However I decided to try the "official" Azureus version from Sourceforge and just backed up /usr/share/java/Azureus2.jar and replaced it with Azureus2.5.0.0.jar. This solved the problem for me, however this is something I would not have expected from a final release. This is not some obscure feature not working, but the whole app not coming up...

Next steps

Next thing I'll try is setting up a 3D desktop environment. I will probably go with the description in this forum entry. I will keep updating as I go...

Friday, October 27, 2006

Flash 9 beta in Ubuntu Dapper

Maybe I shouldn't do vacations anymore... This time I got a nasty flu two days after my return. Well, slowly I am feeling better and thought I might just tell you that the installation of the Flash Player 9 beta for Linux worked like a charm on my Ubuntu Dapper Drake (6.06) machine.

I just downloaded the archive from Adobe Labs, uninstalled the previous version using apt-get and put the new libflashplayer.so file into my private plugin directory. Apart from the

ds@yavin: ~$ sudo apt-get remove flashplugin-nonfree

everything is described in the readme.txt file that comes included in the archive.

It is really great to have synchronous audio/video playback for the first time under Linux. About time, but hey, thanks anyway :)

Saturday, October 14, 2006

MySQL 5.0: DECIMALs queried with Strings

We are currently preparing a MySQL 4.1 to MySQL 5.0 migration. First tests showed a very nasty problem, however.

One of our test cases incorporates queries against DECIMAL columns that use strings as the queried values. In MySQL 4.1 this works flawlessly. The reason behind this is that in contrast to 4.1 the newer server version does a (in my opinion very stupid) conversion from String to double, which in many cases cannot correctly store the precise value.

This may lead to very subtle bugs, especially when using an optimistic locking approach as we do. We only noticed the problem, because we got a ConcurrentModificationException, as an update query that contained a string-ized BigDecimal did not match any rows.

See MySQL bug reports 23260 and 22290 for more details.

Right now this leaves us with little options but to not migrate to 4.1 as our application has several hundreds of thousands of lines where most of the database access is handled by an OR mapping layer, but there are also numerous cases of manually crafted SQL which would be hard to identify and analyse individually.

What I absolutely do not get is that with the introduction of precision math they also begin to use floats and doubles, whereas most people request the new math because it should make monetary calculations more reliable. Instead all of a sudden existing applications are likely to exhibit all sorts of weird problems, from calculation errors to completely different behaviour (see above). Answering complaints about this with a simple "this is documented behaviour" is a bad excuse if you ask me.

I do really like MySQL, it is a great product and it has served me well for years. I have always appreciated the involvement of the community very much, however cases like this may be what (at least partially at this time) makes the difference between the "really big players" and MySQL.

Monday, October 02, 2006

MySQL replication timeout trap

Today I spent several hours trying to find a problem in our application until I found out there was a problem on the MySQL side. In our setup we have several slave machines that replicate data from a master server. The slaves are configured with a series of replicate-do-table directives in the my.cnf file so that only parts of the schema get replicated. The remaining tables are modified locally, so to avoid conflicts they are not updated with data coming from the master.

We do however have the need to replicate data from the master for some special-case tables. To solve this we usually have a column that indicates whether a record was created on a master or a slave machine and use an appropriate WHERE clause in all updates. This avoids collisions in concurrent INSERTs on the master and the slave. The application knows which of the two records to use.

Due to historical reasons I do not want to elaborate on we did not use such a distinction column for one table (let's call it T1). Instead we created a new table T2 that stores data generated on the slave. T2 is only written to when the slave is separated from the network. As soon as it gets reconnected, the data from T2 is sent to an application server which merges it with the master's T1 table.

This usually ensures that T1 is up to date on the slave, too (with some seconds lag, of course), through replication.

However a customer noticed that records that were inserted into T2 and later sent to the application server did not show up in the slave's T1 table, even after several minutes, leaving the application with very out-of-date information.

Assuming some sort of replication error I connected to the slave and issued a SHOW SLAVE STATUS command (I dropped uninteresing rows of the output):

*************************** 1. row ***************************
             Slave_IO_State: Waiting for master to send event
                Master_Host: MASTER001
            Master_Log_File: MASTER001.009268
        Read_Master_Log_Pos: 68660485
             Relay_Log_File: SLAVE005-relay-bin.000071
              Relay_Log_Pos: 27920448
      Relay_Master_Log_File: MASTER001.009268
           Slave_IO_Running: Yes
          Slave_SQL_Running: Yes
         Replicate_Do_Table: ...
Replicate_Wild_Ignore_Table: ...
                 Last_Errno: 0
               Skip_Counter: 0
        Exec_Master_Log_Pos: 68660485
            Relay_Log_Space: 162138564
      Seconds_Behind_Master: 0
1 row in set (0.00 sec)

As you can see both the Slave_IO thread and the Slave_SQL threads are running. So no replication problem here, everything is fine. Or so it seems.

Suspecting a problem with our application code I went through it, line by line, because at certain points it issues STOP SLAVE and START SLAVE commands. However I could not find anything I would not have expected.

However what I found when re-running the SHOW SLAVE STATUS several times in the course of a few minutes made me curious: It did not show any changes to the position in the master- or relay log files. This effectively means that no updates are replicated from the master, even though the everything appears to be alright.

Even then I did not have the right idea but suspected some sort of bug in the old 4.1.12 release of MySQL we are using. But upgrading the slave to the latest 4.1.21 release did not solve the problem either. It could still be easily reproduced by unplugging the cable, creating some data in T2, reconnecting and waiting for the data to be inserted into T1 on the master but not the slave.

I only got it, when I saw that a simple STOP SLAVE; START SLAVE fixed the problem. (I know, should have had this idea earlier...).

The reason for this strance behaviour is the default setting for the slave-net-timeout variable (3600s). Once a slave has connected to its master it waits for data to arrive (Slave_IO_State: Waiting for master to send event). If it does not receive any data within slave-net-timeout seconds it considers the connection broken and tries to reconnect. To prevent frequent reconnects when there is little activity on the master this setting defaults to one hour.

However it does not reliably notice a disconnect that occurs less than one hour after the last replicated statement. That is exactly what happened: Unplugging the network cable broke the master-slave connection, however the slave did not notice and therefore still displayed a "no problems" status. Had I waited for an hour I would have seen the data arrive alright...

So to get quicker updates I just had to decrease the timeout value in the slave's config file:

slave-net-timeout = 30
master-connect-retry = 30

Because in our case the activity on the master side is quite heavy, 30s without any updates is plenty and most certainly an indication for a dropped network connection. With these settings the slaves notice a broken connection within half a minute and immediately try to reconnect. If the network is still down they will keep trying in 30 second intervals (master-connect-retry).

So lesson learned: do not trust the output of SHOW SLAVE STATUS unconditionally!