Tuesday, September 25, 2007

The value of a CVS commit database

Due to some discrepancies between the Eclipse 3.2.2 compiler and Sun's javac we needed to upgrade our development environments to Eclipse 3.3. Otherwise we could not tell for sure that something that looked ok in Eclipse would compile in the daily build process.

Even though I had used 3.3 privately for some time now, there is always some tension when switching a whole bunch of developers in a really large project.

At first everything seemed fine, apart from some minor issues that could be easily worked around.

However I ran into a nasty little bug regarding the CVS integration when I had to switch a workspace that had been checked out on HEAD to another branch. That branch had been created to keep the Helpers- and Utilities-Refactoring I wrote about before separate from the HEAD until it is complete.

Within Eclipse you can just select "Switch to another branch or version" from the context menu on any project (or file for that matter) and select the branch you would like. I had done this with 3.2.2 several times without any problems. So not suspecting any problems I switched my local working copy to the branch and started checking in modified files after some more refactoring.

However shortly after I had done that a colleague complained that there were compile errors on HEAD now. Turns out that Eclipse 3.3.0 has a well hidden bug in that feature: The version switch involves some requests to the CVS server. This works alright for files that have no changes in the working copy, however for files with outgoing changes the server response is not handled correctly and those remain on HEAD. Because I had made some changes before the switch already, part of my changes went to the branch, the rest of them to HEAD, leaving both in an incomplete state. For details on the bug see Eclipse Bug #192392.

The files I had checked in spanned several projects and where of course committed in little chunks with different comments. At that point I was very glad that I had my ViewVC commit database to query for anything I had done over the last few hours that had gone to HEAD. Without it it would probably have taken me hours to just find out which files I had checked in on the wrong branch. While it was still some tedious work to actually restore everything back to the state I wanted, just identifying the files affected was done with a rather simple SQL query in no time.

I can only advise anyone working on a project with more that just a few files to set up a database that stores all commits by type (addition, change, deletion), file, branch, date and author. This wasn't just my life insurance in this case, but in combination with a full text index on the commit comment field it is also a very good base for change logs - using simple SQL they can be generated very flexibly and within just the blink of an eye.

The version of ViewVC we use is rather old and contains some custom changes that probably would not be required with a more recent release. So I recommend taking a look at the current version the project offers.

Thursday, September 20, 2007

Helpless Helpers and Useless Utilities

With any code base of a reasonable size there are lots of issues you would normally take care of immediately when you come across them, however often there is just no time for it. In the end you will have to live with the knowledge that you had to leave some ugly hacks in it just to meet the deadline.

Because we have recently finished development of the next major release of our software product, there is some time now to do code cleanup and get some more automated tests on the way. Because one of the bugs that almost prevented us from holding our schedule was a particularly nasty - but well hidden one - there has (again) been some discussion about coding guidelines and quality.

People always seem to agree that you need to talk to each other, think in larger terms than just your specific problem at the time and strive for code readability and re-usability. For starters I personally would sometimes even do away with just a little more of the former one...

The bug I mentioned above was in a method to determine whether a given date is a holiday or a business day. Because this is not a trivial problem to solve generically for several countries, we rely on a database table that is pre-populated with the holidays for the next few years in a given country. To tell whether or not a date is a holiday is just a mere database lookup. Nothing can really go wrong here, can it? The method in question was implemented like this:

public static boolean isHoliday(Date aDate) {
   if (getRecordFromHolidayTableFor(aDate) == null) {
      return true;
   }
   return false;
}

Of course there was a corresponding method for business days, too:

public static boolean isWorkday(Date aDate) {
 Calendar cal = Calendar.getInstance();
 cal.setTime(aDate);
 int day = cal.get(Calendar.DAY_OF_WEEK);
 if (day == Calendar.SATURDAY || day == Calendar.SUNDAY) {
     return false;
 }
 return isHoliday(aDate);
}

Just in case you do not know: Saturday is usually considered a business day in Germany, for our customer it definitely is. So this part of the latter method alone is at the very least somewhat dubious.

By now you will probably have stopped looking at the code, trying to find out what it is you miss to make this work correctly. You are right to do so, because it is of course just plain wrong. The isHoliday method returns just the opposite to what you would expect. And the isWorkday code - apart from the wrong condition for the DAY_OF_WEEK even leverages this and works in fact (almost) correctly.

One might ask how this could go unnoticed till the last moment of development and I can tell you, the answer will not be pleasant: It didn't!

There were about 20 calls to the holiday method. 19 of these always inverted the result! The author of the 20th call had relied on this seemingly simple method to just work and only found out about the bug when he accidentally connected to a database with an empty holidays table and still got true out of it.

Both of the above methods were part of a class called "DateUtil". Apparently someone had had the idea of providing common date functions with this class, to spare everyone from writing their own methods to manipulate and use dates.

While this, of course, is generally a good idea it would have been even better to provide some thorough testing code for it, because even in the seemingly simplest of methods you have a sufficiently high chance of making a mistake while implementing it. One could point a finger to the author of that class, because he introduced the bug in the first place. But I would rather have 19 more fingers to point at each and every one who noticed the problem and just worked around it in his or her private piece of code, telling no one about the problem and letting everyone find out for him- or herself again. I will not even try and estimate how many hours were spent uselessly because of this...

But it did not end there. Alerted to this I decided to have a look at the project source tree. In the course of development I had already seen classes like StringUtil, CalendarUtil and others. So I just ran a find workspace/ -regex ".*(Util|Utility|Utilities|Helper)\.java" to look for classes matching that pattern. I ended up with over 160 helpers and utilities. Many of them were obviously duplicating functionality just by looking at their names. Several PersistenceUtilitys, StringHelpers and the like. Looking through them they often contained very similar methods, re-implementing the same task over and over again.

Because practically none of these were covered with (J)Unit tests but had only been empirically tested on the precise context of their calls, you could not even just go ahead and replace calls to a similar sounding method in a more central place - and vice versa, making almost all of them virtually useless from a more general perspective.

So the whole idea of creating utility and helper classes had in fact been reduced to absurdity. Right now we are in the process of looking through all of them and consolidating to a handful of really helpful helpers and useful utilities - each of them accompanied by a set of unit tests to prevent another "a holiday-is-a-business-day" problem in the future. We will see what other gems bubble up and see the light of day in the process...

Tuesday, September 11, 2007

Hard disks are too big

I remember when I got my first hard drive. It was part of the first PC machine I ever had, a 386DX40. Before that I owned an Atari 800XL and an Amiga 500. While there were hard drives for the Amiga I couldn't afford one at the time. The 386 came with a 52MB Quantum. Back then I had to decide whether I wanted to have Windows 3.0 on the disk or Wing Commander 2.

While I'm not really too eager to go back to those times I recently noticed (again) that the vast amounts of disk space we have today are not always a good thing in a manner of speaking.

Over the years I owned several computers, with each new one came a new, bigger hard drive. As I usually sold the old machine - and the smaller hard drive with it - I just copied (almost) everything into some sort of "old_disk" folder on the new machine. When the next upgrade came, I had (of course) not reviewed everything in that folder and so, just to be on the safe side, just put it into the next "layer of archive" directory.

Currently I am using a Pentium IV machine with three hard drives. An older IDE drive with 80GB, one SATA 250GB and a third one with 400GB, also SATA. When I first installed Ubuntu in the summer of 2006 (Dapper Drake) I decided to give it the whole 80GB drive. To make space I just moved all the Windows data on that disk to the 250GB one. (Guess what... "80gb_disk" the folder was called).

Now, a little more than a year later, Ubuntu has replaced Windows as my primary and everyday operating system. I still keep Windows for the occasional game of Far Cry, but apart from that I do not really need it anymore. The last piece of hardware I cannot get to work with Linux is the CanoScan LiDE 70 scanner, but that will be replaced soon. Because the 80GB was filling up steadily I decided to do it properly this time and dedicate the whole 400GB disk to the /home partition.

Of course I needed to get the data off there, but as it was more than the already crowded 250GB disk could take, I finally had no choice but to wade through the data and see what I could throw away.

It was just unbelievable. I found things I had completely forgotten about; letters, pictures, stuff from school and university I never thought I had kept in the first place. It was like going through those boxes that somehow seem to magically appear in the attic and the basement every time you move to a new place. Some of the things were really great to find - like meeting someone you haven't seen for some time.

However most of the space was just taken up by old Windows installations, "temporary" download folders and rotting user profiles. Do you know that feeling when doing the occasional re-install of Windows to better keep the old installation somewhere, in case you need that one special settings file or have to take a look at the old registry again? Or let that ISO image lie around and burn it to a CD or DVD later and then never do it anyway? I found at least 4 Outlook PST files - apart from the last one I really used, of course

It took me more than 6 hours to go through it all and this time really deleting stuff. Most of it was complete trash and absolutely useless. I still kept all of my documents, music files and other "handmade" stuff. In total I removed around 290GB of old cruft...

Imagine yourself back to the 386 time, would you have believed anyone, had they told you that you were going to have roughly six thousand(!) 52MB disks, filled up with nothing but useless junk? :-)

Sunday, September 02, 2007

ERROR 1033 (HY000) on InnoDB configuration error

One of the key features MySQL often uses to advertise for their database is the modular architecture that allows them to have different storage engines below the same SQL layer. In practice the application and/or database designer can choose from a variety of low level data storage implementations that each offer different characteristics and may be chosen on a per table basis. (Even though I personally believe most designs will use one type of table for all tables of a particular schema).

The idea behind this is that for example people who do not need transactions should not have to worry about them at all – maybe there is a performance impact involved which they cannot afford to take. Moreover some specialized types of index or column might not be available on all engines. Basically the concept is very interesting and can be really useful for developers.

However there is a weakness that in my opinion needs some severe work to be done: The interface between the common SQL layer and the storage engines seems to be somewhat limited with respect to what storage engines can do to inform the level above about status and error conditions.

For example there is no (elegant) way to find out about the details of a constraint violation problem when using the InnoDB storage engine. While you will get an error message that some statement failed due to a violation of referential integrity constraints, you have to use the generic “show engine innodb status” command to get some details. However this will not only tell you about the error you care about at that particular moment, but will also give you lots of information on lots of other stuff inside InnoDB. This is however necessary, because you do not have any other means of find out out about those – e. g. when you are investigating a performance problem.

From what I learned from a consultant some time ago this is due to the limit interface specification through with MySQL itself (the upper layer) and the storage engines talk to each other. Because this protocol has to be somewhat generic messages from the bottom to the upper levels have to be somehow wrapped into some form of special result set which you then have to parse and understand on your own. Moreover if memory serves me right, there is a limitation on how much data can be transferred at a time (could be a limitation of the client as well). Because of this you will not even always get a full InnoDB status output, because it will be truncated if it gets bigger than 64k.

While this is not particularly nice it is a limitation I believe is acceptable, especially in the case of InnoDB, because the innodb_monitor feature allows you to get the full output into a log file.

What I consider much worse however, is that error messages from the underlying storage engine are often mapped to more generic MySQL messages in an unpredictable way.

Time and again I have run into problems that present you with an error message that has nothing to do with the actual problem. For example in a replication scenario you might get an error message 1236, claiming that there is something wrong with replication position counters, but it turns out that this message can also mean a full disk on the master. If you know enough about the implementation details you might see the way this message comes to pass, but if you are troubleshooting a production system this is not what you want to do. Moreover I tend to forget these peculiarities if they are just seldom enough. Just recently I found a machine spitting out strange errors on each and every query I issued (InnoDB):

mysql> select count(*) from child;

ERROR 1033 (HY000): Incorrect information in file: './test/child.frm'
mysql> 

Now, just having this message, what would you expect is wrong here? File corruption? Broken SCSI controller? Faulty memory?

When you use Google to search for “ERROR 1033 (HY000)” you will get all sorts of results, most of them suggesting to try myisam_check (not very useful for InnoDB) or the REPAIR statements. Often you will find someone who claims that restoring from the latest backup might be the only option.

While all of this is certainly true to solve the problems that this error message what originally intended to report, in my case they were all just leading in the wrong direction.

Turns out that something was wrong with the my.cnf configuration file. This was on a machine set up using the “Streaming Slave Deployment” mechanism I described in an earlier article. For some reason the script that usually adapts the config file automatically after downloading the data files had not been started, so a default my.cnf was still in place.

Unfortunately the InnoDB data file sizes did not match those downloaded from the master server. This is what my.cnf contained:

innodb_data_file_path=ibdata1:512M;ibdata2:100M:autoextend

This is a listing of the data directory:

-rw-rw---- 1 mysql mysql 555745280 2007-08-31 20:26 ibdata1
-rw-rw---- 1 mysql mysql 104857600 2007-08-31 20:26 ibdata2
-rw-rw---- 1 mysql mysql   5242880 2007-08-31 20:26 ib_logfile0
-rw-rw---- 1 mysql mysql   5242880 2007-08-31 20:24 ib_logfile1
drwxr-xr-x 2 mysql mysql      4096 2007-08-31 20:27 mysql
drwx------ 2 mysql mysql      4096 2007-05-12 12:24 test 

It may not be obvious, but 555745280 bytes for ibdata1 is not 512MB, but 530MB. Nevertheless the MySQL server started even with this wrong configuration. However every statement would fail with the message above.

Shutting down the server and correcting the line above to

innodb_data_file_path=ibdata1:530M;ibdata2:100M:autoextend

restored everything to a working state:

mysql> select count(*) from child;
+----------+
| count(*) |
+----------+
|        9 |
+----------+
1 row in set (0.03 sec) 

While I really like MySQL and find it generally rather easy to configure and get very high performance, this is definitely a major weakness I would like to see improved in future versions. For the time being I will try to make a post about anything that strikes me odd enough for someone to be interested in, too :)