on GUIs and such things
Bill Frantz
frantz@communities.com
Tue, 25 Jul 2000 10:35:00 -0700
At 11:37 AM 7/25/00 -0400, Jonathan S. Shapiro wrote:
>> > ECC RAM supports reasonable error detection and only modest error
>recovery.
>> > The problem is that as memory densities improve the particle hits knock
>out
>> > more bits, and you start to need more correction bits in the RAM.
>>
>> Do we have stats on how often this happens?
>
>In 1987, when we first installed parity memory in an early 3b2, we observed
>four crashes in the course of a year that were ultimately attributed to
>particle hits. This is already too many, but as memory density has gone up
>so has the hit rate.
I will bow to your superior knowledge in this area Jonathan, but there an
observation that bothers me. If we build a memory card (~= SIMM/DIMM) out
of multiple chips, then it seems unlikely that a single alpha particle
would affect more than one chip. (We hope that the normal DRAM refresh
logic runs the ECC and recovers from transient damage to any of the bits.)
The remaining problem is getting memory cards with enough chips (or the
bits of a word spread out "enough" on single chips) to let the ECC work.
One other interesting factoid is that the early Cray 1 installed at Los
Alamos had parity memory. After several months of operation, they
discovered that the largest single cause of system failure was single bit
memory errors. Later Cray 1s had ECC. (And modern systems are bigger and
faster than that Cray 1.) ECC memory seems attractive for EROS machines as
well.
Cheers - Bill