Sunday, April 20, 2008

Stale /etc/group.lock & /etc/gshadow.lock on RH9

Last week for the first time in quite a long time an automated update process failed me. Even though it had been in use for several months already, and had seen numerous successful test runs, one machine was (almost) completely screwed up after the process. It may be interesting to know that all this is taking place on rather "antique" Red Hat 9 systems. I have not had the time to double check whether the following still holds true for more modern systems, any comments are appreciated.

The update logs contained a strange line:

useradd: unable to lock group file

Well, while the message itself is one of the more explicit ones, I could not understand what would be causing a lock on the group file. There is indeed a useradd command in the upgrade script, however it is the only one, and the update takes place right after a reboot. Hence no other process could possibly hold a lock on /etc/group.

Looking at the /etc directory I found these two files: /etc/group.lock and /etc/gshadow.lock.

I was quite surprised, when I could readily add a new group manually on the affected machine with the exact same command line from the update script.

After that a quick sweep over all production systems that are to be upgraded revealed, that each and every one of them contained the two files named above, usually with a creation date at about the time of their initial setups. So how could any of those ever have been upgraded successfully? Apparently my manual useradd succeeded as well...

Turns out that the .lock files are not just simple flags. At first I thought they were just by their presence acting as a signal to any command trying to change the contents of the group files. Apparently that's not the case. Their content is relevant as well: They both contain the process ID (PID) of the command that last acted upon the /etc/group and /etc/gshadow files. Unfortunately due to a bug described in a RedHat Errata Entry (on RH Desktop 3, not specifically RH9) those files do not get properly unlocked, when the useradd command exits (even normally).

So you get a stale lock file lying around. Next time useradd tries to modify the group files, it will not complain if there is no process currently running that has the PID denominated by the .lock file! This is really nasty, because it will work almost every time you try, but there is of course a realistic chance that some random process will just by chance get the same PID the first useradd once had and be running at the time you call it the second time. In that case, useradd is mistaken to believe that some other process is currently modifying the group or gshadow file and aborts with the error message above!

I have not found an updated shadow-utils RPM for RH9, so right now I need to scan all our scripts for group modifications and replace the groupadd calls to a function or script that will take care of removing the lock file once the command has finished.

#!/bin/bash
groupadd ${*}
rm -f /etc/group.lock
rm -f /etc/gshadow.lock

Great that this hit us on a test system before rolling out into production.... Maybe sometimes Murphy is a nice guy after all ;-)

No comments: