Backing up one more step...
Bryan Ford
baford@schirf.cs.utah.edu
Mon, 05 Dec 94 20:46:35 MST
>Back when I was looking at distributing KeyKOS, the interest was
>prompted in part by some work on ATM-based distributed shared memory
>(DSM) that was being done by John Shaffer here at Penn. The question
>I was pursing was really: could we take the single level store model
>of KeyKOS and implement a DSM layer beneath it. The idea was to get
>to a single system image running on a cluster of workstations.
>
>The problem is that the single level store in KeyKOS *is* the system,
>and as such, it's very difficult to know what to do about partial
>failures. If a KeyKOS node is loaded to a workstation that crashes,
>it's not at all obvious how much of the system needs to roll back to
>the previous checkpoint. It's also not obvious how to efficiently
>track the dependencies. The harder I looked at mechanisms for
>dependency tracking, the more I concluded that the only answer I could
>find that had acceptable overhead was to roll back the cluster in the
>event of unplanned shutdown.
One way to approach this problem, which I think would be a very
useful solution, and probably not impossibly difficult, is to allow
any particular node to support multiple "persistence districts" at
once, each of which is a possibly distributed set of processes and
related resources that gets treated as an atomic unit in terms of
persistence and fault tolerance. If a particular node goes down,
all the persistence districts that were dependent on that node have
to be rolled back, but no others do. Rolling back one persistence
district effectively breaks connections between it and other
districts, so those connections have to be reestablished in some
way on restart. (Reestablishment could be mostly automatic, depending
on how things are implemented.) Of course, all connections within
a district would be preserved across restarts.
(BTW, the term 'district' is just something that happened to be on the top
of my head; if this is a concept that already has a well-known term, or if
you've got something better, I'd like to know. :-) )
Basically, here's how I envision such a system might work. Each
node starts up initially with the basic OS and user-level support
code in no persistence district, so that code is not persistent
and will be reinitialized from scratch on restart - basically like
conventional Unix systems. Hence, things like user-level device
drivers, which really wouldn't make sense as persistent entities
anyway, operate "outside" all persistence boundaries. (Does KeyKOS
have any notion of "non-persistent processes" or anything like that,
or does _everything_ outside the kernel always have to be persistent?)
At some point, some non-persistent user-level code can then create
new persistence districts, or restart old ones. Anyone with ownership
of (but not necessarily direct access to) some amount of sufficiently-
trusted backing storage could create a new persistence district with it -
basically like "formatting" a new KeyKOS system. How those districts
are created and controlled, and when they are rolled back, could be
a matter of local policy.
One obvious way to use such a system would be to allow different users each
to operate in a separate persistence district (or even to own several
independent districts). Then each user could be responsible for deciding
which nodes to run his district(s) on, depending on how much he trusts
those nodes not to go down "too often". For example, supposing the system
has some kind of Unix-like login procedure, the initial shell prompt you
get immediately after login, and everything underneath it, might be
non-persistent; but then you type something like 'restart my-district', and
the contents of that district gets reloaded and restarted.
The backing storage for these districts could even be implemented as
ordinary files on conventional file systems - if the implementation of
persistence districts is such that the actual bits of the backing storage
must not be seen or touched by untrusted code (e.g. if the backing storage
contains snapshots of trusted kernel-only state, as in KeyKOS), then just
make sure that the act of "formatting" a persistence district file
sets the file ownership to root and the permissions to 600 (or the
equivalent thereof).
Anyway, there would certainly be some unique challenges in building such a
system, but I can't think of any really serious problems, and I think in
practice it would solve the distributed checkpoint/restart granularity
problem you're talking about in a fairly clean, controllable, deterministic
way. It also has the advantage that it might be possible to integrate such
a feature into a Unix-lookalike system such as Mach, without totally
changing the way all the lower-level parts of the system work.
Bryan