5 Uses of Determinism
Mark S. Miller
markm@caplet.com
Thu, 06 May 1999 03:57:42 -0700
The Turing and Von Neumann models of computation are strictly deterministic
(as is that other Von Neumann model, cellular automata). In all of these
systems, given the initial state of a computation and the specification of
the machine it ran on, one can necessarily deterministically replay the
computation.
Of course, this is true only because these models idealize-away many of the
messy issues that plague real computation. Actors are an alternative
foundational model of computation that idealizes-in many of the issues that
these systems idealize-away.
The first such issue, explored by many besides Actors, is the contrast of
"transformational" vs "reactive" computation. A compiler is
transformational: all its inputs exist before it begins, it runs
deterministically producing output, and terminates. A compiler that doesn't
terminate is buggy.
A calculator is reactive: new inputs come in after it's started, these
inputs effect both its state and its outputs. Its outputs may effect what
further inputs it gets, so these outputs need to be transmitted before
the calculator terminates. We do not require the calculator to terminate in
order to be correct. Indeed, for a calculator, the opposite is true. Even
though you can replay a calculator from a log of its inputs, you cannot
model a calculator as something that computes from pre-existing inputs,
since the inputs depend (via the external human) on previous outputs.
Actors are inherently distributed and asynchronous, and so are pervasively
non-deterministic. Actors successfully idealize all non-determinism as
"arrival order non-determinism": Messages race towards an actor from
multiple asynchronous sources. As Einstein would insist, these messages
had no prior determined intrinsic full order. Nevertheless, they are
delivered to the recipient actor in some *partially arbitrary* fully
serialized order. In order to replay an actor, you not only need the
messages sent to it by the external world, you need to know how these races
were resolved.
An E Vat is a lot like an Actor, and at Vat-to-Vat granularity, E
computation is a lot like Actor computation. Within a Vat we have
approximately the singly threaded synchronous deterministic world of Von
Neumann. It seems this is a good enough approximation that we could, under
constrained but useful circumstances, arrange to segregate all the
non-deterministic inputs to a Vat (or a subgraph of objects within a Vat)
such that the remainder of the Vat's computation is deterministically
replayable from these non-deterministic inputs. Why might we want to
*eventually* do this?
1) Cheaper Commitment.
Databases often use this technique by writing a "transaction log". Once a
transaction request has been successfully logged, the transaction can be
committed even though a crash may destroy its result state, given that the
transaction can be deterministically reproduced from the logged request. I
first saw this technique applied to committing general purpose
computation by Rob Strom: Do a full checkpoint occasionally, and cheaply
log inputs until you take the next checkpoint.
It's not clear whether this wins over KeyKOS/EROS mmu-based checkpointing,
since that's also incremental. But is sure beats the crap out of
non-incremental checkpointing, which is the only reasonable alternative for
those sitting above the jvm.
2) Cheaper Fault Tolerance.
This is really a variation of #1, where the understudy is separated in space
rather than time. I first saw this technique applied by Auragen (which
since disappeared), and since rather brilliantly by Rob Strom.
3) Inward Bit Confinement.
Once we've plugged all means of communication accounted for by causality
in our computational-model (or system-specification), all remaining
ability to communicate must ride on informational inputs that are modelled
as uncaused ("spontaneous"?). These are exactly the non-deterministic
inputs to computation. In the confinement scenario
http://www.erights.org/elib/capability/confinement.html , Alice generally
isn't in a position to constrain Mallet (or verify that he is constrained),
but she is in a position to constrain Bob (or verify that he is constrained).
If she constrains Bob to be deterministic only except for non-deterministic
inputs that she believes Mallet cannot influence, then she knows that Bob
cannot receive covert information from Mallet. Therefore, Bob cannot
receive instructions from Mallet for making use of those powers given to him
by Alice, but which Alice wishes Mallet not to exercise. Even though Bob
wishes to use these powers to do Mallet's bidding, he cannot find out what
that bidding is.
4) Debugging.
When a compiler demonstrates a bug, since you presumably have the program
which elicited the bug, you can reproduce it. Assuming your debugger
doesn't effect computation as experienced by the compiler, you should be
able, after the fact, to reproduce the bug under the debugger; even though
it's later, the bug didn't happen under the debugger, and no extra cost was
spent to instrument the compiler. This effect, normally taken for granted,
shows the power of deterministic replay at its best.
For non-deterministic computation, if we don't log non-deterministic inputs,
we can't replay past computations. Such a non-deterministic program may
drive us crazy with bugs that occasionally manifest themselves, but that we
can never catch under the debugger, because it happens to take other paths
then.
If we are willing to pay the instrumentation costs to log these inputs, then
we can deterministically replay the remainder of the computation under a
debugger, and watch a previous bug occur under the scrutiny of the debugger.
Unfortunately, these instrumentation costs can be large. Fortunately,
tricks #1 (commit) or #2 (fault tolerance) would already have us pay these
costs, because they're still cheaper than the alternative.
For example, one can combine these techniques to build a server that is
extremely resilient to its own bugs, and that really helps in finding these
bugs. (We came up with this plan originally at EC, but never came close to
implementing it.) The server process occasionally checkpoints, and also
logs all non-deterministic inputs needed to replay it from one checkpoint to
the next. It keeps at most three checkpoints on disk simultaneously. In
order from oldest to youngest:
a) the one committed to.
b) the one being checked.
c) the one being written out.
as well as the two log-sequences a..b and b..c. Unlike scenario #1, the log
doesn't let us commit at finer granularity that a full checkpoint. Rather,
while it typically allows us to recover (from a crash) at this fine
granularity, our only real commitment is to checkpoint #a. Once #b is
completely written out, it is read-in/revived in a checking process which
asks all the objects which can to check their internal consistency. If all
these objects say everything's cool, we commit to #b and throw away #a and
a..b. If we're not cool, then we save away #a, a..b, and #b for debugging,
crash the server, revive it from #a, and continue the service forward.
This time, we hope that non-determinism will allow the live service to avoid
hitting the bug again. We can even use our knowledge of where the
non-determinism lies to ensure that other correct choices are made whenever
possible, to make path divergence more likely. The very irreproducibility
of non-deterministic bugs, that's normally such a liability to robustness, is
potentially an asset for staying in operation while we figure out what's wrong.
To do this figuring out, we've got #a, which reported itself to be
consistent, or we wouldn't have committed to it. We've got #b, with a known
inconsistency. And we have the a..b log, which should allow us to
scrutinize the computation as it proceeds from an apparently consistent
state to an inconsistent one. Given the E model that non-deterministic
inputs only enter as discrete events on the event loop, and given enough
spare cycles, one could even automatically isolate, with the
self-consistency checking, the event at which inconsistency becomes evident.
Only the execution of this individual event would then need to be stepped
through manually in a debugger.
5) Contract Verification.
This is a new discovery, and is the most interesting. It will be covered in
later email.
----------------------------------------------------------------------------
Hash Tables and Free Will
Also to be covered in future email:
I don't expect to do any of these immediately, but #3 (confinement) and #5
(contract verification) do become relevant in the near future. It turns out
that #5 has a lot in common with #1 (checkpoint), #2 (fault tolerance), and
#4 (debugging), but #3 (confinement) is curiously different. To do any of
these, we need to do a good job identifying all the hidden sources of
non-determinism. I thought I had, but recently hit a rude surprise: hash
table enumeration order. Do KeyKOS and EROS also have this covert channel?
(I think so.) Was this channel identified? Are it's dangers understood?
To plug this, I came up with a scheme, refined by Ping, to make hash tables
deterministic. It's cheap but it's not free. However, it does force a
non-upwards compatible change to the determination of whether two ConstMaps
are the same. In all other ways, the proposed hash tables are upwards
compatible to the current ones. I look forward to writing this up.
Cheers,
--MarkM