Thursday, March 16, 2006

One simple line of code

Today's subject: a little parable about the complexities and obtuseness of our software dependent world.

Last night during our launch we encountered a problem which provoked the decision to roll back. It wasn't that the code was horribly broken, but the discovery of the issue with only 15 minutes left before the "point-of-no-return" gave us no time to try for a fix.

Ten minutes after the roll back decision was made, the dev team announced that they'd tracked down the bug and that it was a simple one line code change. After weighing the risk against the need to come back and try again the next day, the decision was made to go for it. Everything turned out well and although it took a little longer and some stomach churn, we were able to successfully deploy.

The rub is, right at the start of our night, we'd discovered that a feature that had been working in production was suddenly not working. So during the course of the launch, our talented project manager was working that problem at the same time. All of the different network elements seemed to be working, and we were having no luck tracking down why this feature had stopped working. Early in the evening we found out that one of the gateway teams had been updating some code on the load balancers, but they didn't think that it would have caused the problem. So even after we'd successfully launched our code at 4:45, the team had to stay to work on this new issue. After 6 hours of troubleshooting with an ever growing team of folks on the conference bridge, after waking folks up at 3:00 AM, and through a shift change on some teams, it was discovered that the "one line of code" implemented on the load balancers earlier was indeed the culprit.

One simple line of code...

No comments: