Thursday, January 26, 2006

CVS: Encoding mixup

Having completed the migration of our CVS server I am now slowly coming around to get loose ends tied up. One very unnerving thing was the now mixed encoding of log messages in the CVS histories. Comments that were saved before the migration now show up with little squares instead of the German umlaut characters whereas only the new ones are displayed correctly (UTF-8).

This would not be too much of a problem, one might say, because the older the comments the less they are needed anymore as development goes on. That's what I thought, too, until I tried to get the MySQL based commit database of ViewVC running. Setting up the schema was no problem, however I had to make several columns wider than the default, because they were too small for our needs.

While fixing that I upgraded from MySQL 4.0 to 4.1, not expecting too much trouble. Already anticipating something like older libraries and/or python bindings I disabled to password for the user updating the database. So far it worked, but as soon as ViewVC tried to insert anything into the database lots of warning concerning unknown character sets were issued. I had set the default character set of the MySQL server to UTF-8, because that's what we use on all our servers.

After some googling around I found myself building a new version of the python MySQLdb module. That at least got rid of the annoying warnings over the place.

However I also had to adapt the script, because when I tried to import all the existing commits into the database, it would insert all the UTF8 sequences in the newer comments directly into the database, making a mess of those. While I was willing to accept the distorted look of the old comments, I surely was not with the more current ones. So after some more googling and fiddling with the python code, I finally managed to get all of them right, by first trying to UTF-8-decode the messages that come from CVS's file history output. In case they only contain ASCII characters, nothing happens. In case I have old comments that contain German umlauts, I get a UnicodeDecodingException which I catch and just insert the comment into the database as is. And, finally, the new comments containing UTF-8 sequences get cleanly resolved and inserted correctly into the database. Although I do not like the idea of flow control by exceptions very much I accept it, because I guess it will just be a helper construct to get the old comments into the DB correctly.

Now the only thing I have to do is generate some nice looking changelogs from the database :)

If anyone's interested in the change I made to the ViewVC script, just post a comment and I will provide a few lines.

No comments: