[cap-talk] Ariane 5 meme (was: Notes from Butler's 2006 Usenix)
jed at nersc.gov
Fri Apr 11 13:08:59 CDT 2008
On 4/11/2008 8:30 AM, David-Sarah Hopwood wrote:
> Jed Donnelley wrote:
>> Ariane 5 story (floating overflow and how it caused rocket
>> to fail).
> Just to avoid spreading inaccurate memes:
> The arithmetic was fixed point (that kind of system rarely uses
> floating point). The overflow *was* detected, but not handled
> correctly: the resulting error was reported in-band and confused
> for legitimate data (not that detecting it as an error would
> necessarily have been sufficient to save the mission at that point).
> The problem was not discovered in advance because it was incorrectly
> assumed that the testing of that part of the system done for Ariane 4
> was sufficient, even though the range of the parameter that overflowed
> was known to be greater in Ariane 5.
Butler was discussing how there is well understood technology
for increasing dependability through redundancy. He was noting,
however, that you need independence of the failure mechanisms
which is often difficult to achieve with software. Since I was
just taking notes, here is what Butler said literally about
The reason for that <destruction of Ariane 5> was that
there was an overflow in the floating point to integer
conversion inside a module that was actually not being
used for anything. The same overflow occurred in both
copies of the machine. For a variety of complicated
reasons that caused the entire guidance system to shut
down, which meant that the rocket couldn't function.
Its a fascinating story. You can find the exemplary
report of the commission of inquiry on the Web I think.
Now I hope I'm out of the middle and any disagreement
is between Butler and others (interpretations of words, etc.).
I think my brief note fit well enough with what he said,
and actually with what you wrote David. I'm not sure why
you considered the note an "inaccurate meme". Neither Butler
nor I said anything about whether or not the floating overflow
was detected or whether the floating point arithmetic
was handled by hardware or software. Was he wrong that
a floating point to integer conversion was involved? The
above Wikipedia article says:
"Because of the different flight path, a data conversion
from a 64-bit floating point to 16-bit signed integer value
caused a hardware exception (more specifically, an arithmetic
overflow, as the floating point number had a value too
large to be represented by a 16-bit signed integer)."
I guess sometimes people are particularly sensitive about
From the viewpoint of cap-talk, however, I think perhaps
there isn't much value further discussing Ariane 5 or its
memes or even such software failures, unless perhaps we'd
like to discuss how the capability model might be used to
make such systems more reliable. ??
Just briefly on that topic (if others are interested):
Butler said a bit more about avoiding catastrophes in
later discussion (e.g. mentioning the USS Yorktown
who's engines were shut down due to a software failure,
and the well known Therac 25 problem (Butler misspelled
). Heh. I was in Germany in 1995 when the whole vaunted
Germany rail scheduling system shut down for the better part
of a day, apparently because of a software flaw in a system
upgrade that they couldn't quickly back out.
I do believe capability technology can help improve
the reliability of software, but I think it could make
a rather minor contribution. The main issue seems to
be in separating what is more important from what
is less important and simplifying the processing for
what is the most important so that it can be carefully
reviewed for correctness. Capabilities can help provide
the needed separation of the different functions, but
of course they can't help architect the separation to
begin with or make the code in any section correct/safe.
More information about the cap-talk