At 16:29 -0400 99/6/20, firstname.lastname@example.org wrote:
>** MOTIVATION FOR TRANSPARENT PERSISTENCE
>There are strong (even religious) arguments for and against transparent
>persistence. In EROS, the design assumption is that mission critical programs
>are broken into separate processes for reasons of protection and fault
>containment. This leads to two design challenges:
> 1. How to avoid recreating them if the system crashes
> 2. How to keep them mutually consistent.
It may reek of religious bigotry but I do in fact advocate debugging the
(I just got my Crumudegon's License!)
Yes I understand that sometimes applications go into service before the last bug is found.
My petulance stems from seeing too many systems that rely on crashing and restarting as an alternative for debugging the application. Keykos ran a debit card application in Melbourne for about a year with no crashes.
If hardware makes un detected errors then a bug free application will not
We did some very crude experiments and discovered that randomly induced "hardware" errors almost always caused crashes before the next checkpoint. This will, however, depend sensitively on the nature of the application. It is clear that this will not be always true. The 370's that we ran on seemed to run for years between undetected errors and perhaps weeks between detected errors. (I don't count disk errors here which we tolerated by twinning (RAID).) We caught the 370/158 red-handed just once in an error that was not detected by the hardware. It caused a crash with circumstantial evidence that would convince even the ACLU.
The compartmentalization of authority which cause the real problems that Jonathan notes is the very source of the ultimate stability of the system. I wrote some notes just recently on debugging practice in Keykos: <http://www.mediacity.com/~norm/CapTheory/Debug.html>. It notes some of the ramifications of maintaining security during debugging. It also notes a few steps to ameliorate these problems. One of the consequences is towards personal responsibility towards data (state) integrity. Just as there are strict limits on what code can muck with particular data, with debugging itself limited to capability discipline, there are strict limits on who can be responsible for corrupted data.
When I observe corrupted data in a Unix environment I have almost never seen anyone feel responsible. The bug that caused it will probably remain for the life of the system. In Keykos you know what code and what people have had their fingers in the pie. Our experience was that the bugs got fixed. I will admit that it helped greatly that the programmers of the implicated code were generally close at hand! Handing code over to others is IBM's way. If responsibility can be handed over as well this may work.
I don't expect that people will believe that "debugging the application" is feasible. If you consider "debugging the application component" it might be more credible. I don't recall believing it until we merely charged ahead seeing no alternative. I do advocate "try it--you'll like it". Norman Hardy <http://www.mediacity.com/~norm>