Here's some information about availability on Tandem NonStop S-series systems.
One has to distinguish unplanned outages from planned outages; certain configuration changes or software updates require an outage (I believe the EROS design has an advantage over Tandem here). One also has to distinguish outages over which Tandem has some control, such as system software failures, from others such as power failure without a UPS. Also, the customer considers it an outage if his application is down, even if parts of the operating system are still running.
It's difficult to get good numbers. What's interesting is the availability of actual customer systems doing real work in real environments; unfortunately customers don't always inform Tandem every time their system goes down, and they may not keep good records of how long it is down. We do hear about most of the outages that are relevant to us, and some of the irrelevant ones, and even some of the planned outages.
That said, we find customers have an MTBF of about two years per system. A system has two to 16 processors, and failure of one processor doesn't usually cause an outage.
I didn't have availability numbers readily at hand.
shapj@us.ibm.com wrote:
> - - Do checkpoints help in 24/7 uptime?
>
> Not directly; if you fail you still lose connections. They help indirectly in
> several ways:
>
> 1. System restart time is dramatically reduced. In the case of KeyKOS on older
> machines it was about 30 seconds. It's roughly the same on modern machines
> because the BIOS programs are getting more complex. KeyKOS had a software MTBF
> exceeding 18 months. EROS, with some further debugging, should certainly reach
> the same target. I believe that this is better than any system currently
> shipping -- probably including the Tandem Nonstop systems -- Charlie Landau will
> know the numbers.
>
> If we translate the KeyKOS recovery times into availability metrics (which are
> meaningless, but currently popular; the poor person at the ATM machine doesn't
> care that it will be back in 5 minutes), KeyKOS (and therefore presumably EROS)
> had uptimes of 99.99993658042% (six nines) ON A UNIPROCESSOR. If memory serves
> me, the very top of the line IBM machines are now providing "five nines" (i.e.
> 99.99999% [should be 99.999 - CRL] uptimes), but to do it are relying on the
> ability to fail processors
> as independent units within a multiprocessor. Note, by the way, that EROS
> implements a couple of checks that KeyKOS did not -- possibly enough to buy the
> marginal uptime.
>
> By the way, the above numbers reflect uptime measured in the field under real
> applications. UNIX uptime numbers are generally published based on synthetic
> benchmarks, and should therefore be treated with a certain amount of suspicion
> -- your mileage will vary. The best systems are generally now claiming 5
> minutes a year, or 99.99904870624 % (five nines). Note that all systems
> actually achieving this are multiprocessors, and that if partial failures are
> considered the real number is rather lower.
>
> I don't know of any other system that has approached this on a uniprocessor by a
> long margin. Charlie?