As a sort of coda to yesterday’s post on the errors in the C code running the Toyota acceleration system, here’s another example of a spectacular failure in C code that resulted in huge losses for AT&T, caused loss of service for many phone users, and even caused the delay of hundreds of airlines flights.
What’s different in this incident was that it happened at AT&T where C was invented and where there were serious testing protocols in place. Nevertheless, a single misplaced break
in a switch statement caused a cascade of errors that brought down a significant portion of the AT&T switching system. The first level of remediation was to fall back to the previous version of the switching software. That happened relatively quickly (about 9 hours) but it took much longer for the engineers to figure out what happened and where the error was. That took about 2 weeks.
The point here is not to feed the anti-C trolls or to stump for Rust, Ada, or any other supposed “safe” language. Rather, it’s to underline the well known fact that even with first class engineers, any large program—regardless of what language it’s written in—will inevitably have errors and some of them will be corner cases that are hard to test for.
The real surprise here is how few such incidents AT&T experienced. Their testing and review processes really did make a difference but nothing can prevent all errors.