April 05, 2003

Technical creativity: when things go wrong

Back in the day when I was in school one of the questions on the Operating Systems midterm involved multiple threads writing to a number of variables without any sort of attempt to take turns. The question was basically: "List all the things that can happen in this situation." At the time this seemed like a stupid question since all the things that could happen were bad, and the solution to fixing them was to wrap the block of code that was writing the values in some sort of critical section so only one thread could be writing at the same time. I was not alone in this belief, a guy in my class wrote an answer similar to the below:

While many horrible things can happen in the situation above, I feel that these things can be easily prevented and will not list them here. I intend to devote my life to squashing these kinds of bugs.

The professor gave the guy one point for his creativity, and it eased later tension when read aloud to a class full of people who had just failed the midterm. Anyway, looking back on this question I can now see just how important it was to be able to think in that way: "How can things go wrong?".

When you are working on some code, you like to think that you know exactly all the paths that can happen in the code and what the program will do when it runs. But when you take your little part and integrate it with your neighbor's part and run it in the lab, or the real world, you see unexpected behavior. A typical initial reaction to this is simply suprise. You don't understand what is going on and so you are just sort of confused by it. Something that you didn't think of is happening.

This is when what I like to call technical creativity comes into play. This kind of creativity is not really the same as when you are using your sense of order, simplicity, and beauty when designing components, but when you are brainstorming crazy situations that can explain unexpected behavior. This skill is important because as developers you live in a world in which you don't know as much as the machine most of the time; it can't cut corners, and it sees all the paths at runtime. For this reason you have to be very systematic in your approach, but also very unconventional in trusting previously trusted subsystems and components.

An example of technical creativity at play is the following story about a strange problem at Bell Labs back in the day. An engineer was working in the lab where he would sometimes stand up and use a certain test machine and other times pull up a chair and sit for a while. When he stood up he would have no problem logging in to the machine, but when he sat down he could never login correctly. After doing this a few times he grabbed some other engineers and they all sat there in amazement as he showed them three times the full cycle of the problem. They all sort of laughed at it and scratched their heads while talking out loud: the machine doesn't care if you are standing or sitting, it doesn't know that, and you are typing in the same thing everytime - you just did it three times, etc. After a little while one of the engineers went over and looked at the keyboard. He noticed that one of the keys was in the wrong place: the j and k were switched. When the first engineer stood up he looked at his hands while he typed since he was bent over, so he choose the k by finding it with his eyes. When he sat down he typed by feel and so typed what should have been a k but was a j instead.

Now, the problem ended up being that a trusted part of the machine was wrong, but the way the engineer tracked it down was by thinking about the fact that the machine does not change its behavior unless you change yours, so the engineer must be doing something different. From this they were able to see the problem. Sounds crazy, but the keyboard was broken.

As I continue working on problems I see that the experienced guys around here have a really good sense of what to question and what to trust. They always assume that the problem is with their application and not MFC or the OS. They are good at tracking what has changed in their code from build to build so that they can see if problems have been introduced, or if they have tickled just the right spot in another system by making a change in theirs.

April 5, 2003 02:54 AM