Wednesday, April 26, 2006

Java Source File Encoding

Some time ago we switched our CVS Server from Debian to RedHat and ran into issues with file name encodings. Now we switched to a new build machine, changing platforms from Windows to RedHat.

The ant build process had run in a Cygwin environment on Windows, so most scripts could be used without changes on the new machine, too. So far the change went pretty smoothly.

However our application for some reason failed to resolve some I18N text values that are stored in property files. Not all of them, but some were suddenly messed up in the GUI, as if the resource's key could not be found in the property file.

It turned out to affect only keys that had German umlauts in the key part. While this is not exactly good style it had worked up to now, so we wondered what might have caused it to suddenly fail. After a while we compared the .class files (produced on the Windows and Linux build servers) of an affected dialog and indeed found a difference in the binary representation of the string constant's value that holds the resource key.

As most of our developers use Eclipse on Windows the source files were saved in Win1252 encoding (Eclipse's default on that platform). In 1252 the German characters are in the range between 128 and 255. When compiling the source on Windows it was transferred into the classfile "as is".

However on RedHat the compiler did not interpret the .java files as Win1252 encoded (who would have guessed) but as something else (probably some misled UTF-8, as this is RedHat's default for all locale settings). So in the resulting .class files the umlauts were represented differently, while the property files went into the final JAR unchanged. At runtime the program looked for the key as found in the class and of course did not find an appropriate entry in the .properties file, thus not displaying the GUI as expected.

Luckily the javac compiler allows a command line switch to specify the source files' encodings. We added the "-encoding Win1252" option which solved the problem.

Even though this works it shows the bug-producing potential non-ASCII characters have in software development. We will now go through the source and change the occurences of umlauts to their ASCII-replacements, just to be sure.

No comments: