Tuesday, 4 June 1996 will forever be remembered as a dark day for the European Space Agency (Esa). The first flight of the crewless Ariane 5 rocket, carrying with it four very expensive scientific satellites, ended after 39 seconds in an unholy ball of smoke and fire. It’s estimated that the explosion resulted in a loss of $370m (£240m).
What happened? It wasn’t a mechanical failure or an act of sabotage. No, the launch ended in disaster thanks to a simple software bug. A computer getting its maths wrong – essentially getting overwhelmed by a number bigger than it expected.
How is possible that computers get befuddled by numbers in this way? It turns out such errors are answerable for a series of disasters and mishaps in recent years, destroying rockets, making space probes go missing, and sending missiles off-target. So what are these bugs, and why do they happen?
Imagine trying to represent a value of, say, 105,350 miles on an odometer that has a maximum value of 99,999. The counter would “roll over” to 00,000 and then count up to 5,349, the remaining value. This is the same species of inaccuracy that doomed the 1996 Ariane 5 launch. More technically, it’s called “integer overflow”, essentially meaning that numbers are too big to be stored in a computer system, and sometimes this can cause malfunction.
Failure to launch
A full investigation of the Ariane incident found that a process left over from software in the previous generation of rockets, Ariane 4, had captured an unexpectedly high reading for the sideways velocity of the newer, faster vehicle – and the Ariane 5 rocket’s software couldn’t handle this high figure. A self-destruct sequence was initiated. A couple of seconds later, the rocket was history, as the video below shows.
Such glitches emerge with surprising frequency. It’s suspected that the reason why Nasa lost contact with the Deep Impact space probe in 2013 was an integer limit being reached.
And just last week it was reported that Boeing 787 aircraft may suffer from a similar issue. The control unit managing the delivery of power to the plane’s engines will automatically enter a failsafe mode – and shut down the engines – if it has been left on for over 248 days. Hypothetically, the engines could suddenly halt even in mid-flight. The Federal Aviation Administration’s directive on the matter states that a counter in the control unit’s software will “overflow” after this specific period of time, causing an error. Although scant details have been released – the FAA and Boeing declined to comment for this article – some amateur observers have pointed out that 248 days (when counted in 100ths of a second) is equal to the number 2,147,483,647 – which is significant.
How so? It just so happens that 2,147,483,647 is the maximum positive value that can be stored by a “32-bit signed register”, commonly installed on many computer systems. On Ariane, by comparison, the software was using a “16-bit” space, which is much smaller and only capable of storing a maximum value of 32,767.
Numbers are infinite, so why choose such limited storage spaces for them? The answer is that computers have traditionally demanded efficiency in all things. Storage space used to be much more costly than it is today and processing larger values took longer. If you kept to certain limits, software was expected to run more smoothly. Rocket guidance systems do a lot of critical number crunching very quickly, so these overheads certainly matter. The problem with that, as the Ariane 5 proved, is that such limitations aren’t always foreseen as problematic.
“We have to recognise that in software we are always approximating reality,” explains Bill Scherlis, a software expert at Carnegie Mellon University. “There’s always an engineering trade-off between the cost of having a more precise representation and the benefit of the efficiency.”
Read More: Here