off by one for 2009 June 24 (entry 0)

< Remedial Coding: Never. Assume. Anything.
Highway Throbbery >

[Trackback URL for this entry] Wed Jun 24 10:46:57 IT'S NOT A RACE CONDITION:

Here's another in an embarrassing series of stupid programming mistakes:

I don't know about you, but I have a bad habit: when I encounter a strange bug in my code and I'm not sure how it could have happened, or it involves some kind of event that happened or didn't happen when I thought it was supposed to, and especially if it involves data corruption, I start thinking it's a race condition. Which, of course reminds me of a certain House, M.D. meme.

It COULD be a race condition. It's possible. Especially if I'm testing on a multi-core machine. (And these days, aren't they all?) It's also more likely to be a race if it locks the machine, or if it's unpredictable. But every time -- every single time -- I have thought that an elusive bug was a race condition, it wasn't. It was an ordinary, mundane, bone-headed move on my part.

Allow me to digress for a moment. In religious circles, you hear people talk about "sins of commission" versus "sins of omission." In the former, it's something you've done, such as stealing, murdering, or coveting your neighbor's gorgeous donkey. Sins of omission are the things you haven't done, but should have, such as failing to honor your father and mother or not loving your neighbor as much as you love yourself. It's easier to recognize the wrong things we do than to recognize the right things we fail to do. It's similar in programming. Most mundane bugs are "wrong things we do" -- erroneous code we wrote into the program. Most race conditions (in my experience) are the result of necessary things we failed to put into the program (locking and/or synchronization).

It's fundamentally hard to pore over your own work to find mistakes. You kind of have to pretend you didn't write it -- or assume that it's wrong -- because if you'd known it was broken, you wouldn't have written it that way to begin with. It can also be hard to swallow your pride and admit that your work is also the most likely source of error. Subconsciously choosing to assume that the problem is a race somehow saves face (even while it dooms you to hours of adding specious locks and fruitless poking around).

In my experience at least, you're only fooling yourself. Next time, unless you're absolutely positive it's a race -- assume that it's not, and start looking for assumptions you made in your own code. Maybe you calculated that pointer incorrectly. Or maybe your loop is exiting earlier than it should. That innocuous helper function that you skimmed may be stabbing you in the back.

Filed under: technical:coding, technical:kernel


Unless otherwise noted, all content licensed by Peter A. H. Peterson
under a Creative Commons License.