5 Uses of Determinism Mark S. Miller (markm@caplet.com)
Thu, 06 May 1999 03:57:42 -0700

The Turing and Von Neumann models of computation are strictly deterministic
(as is that other Von Neumann model, cellular automata). In all of these
systems, given the initial state of a computation and the specification of the machine it ran on, one can necessarily deterministically replay the computation.

Of course, this is true only because these models idealize-away many of the messy issues that plague real computation. Actors are an alternative foundational model of computation that idealizes-in many of the issues that these systems idealize-away.

The first such issue, explored by many besides Actors, is the contrast of "transformational" vs "reactive" computation. A compiler is transformational: all its inputs exist before it begins, it runs deterministically producing output, and terminates. A compiler that doesn't terminate is buggy.

A calculator is reactive: new inputs come in after it's started, these inputs effect both its state and its outputs. Its outputs may effect what further inputs it gets, so these outputs need to be transmitted before the calculator terminates. We do not require the calculator to terminate in order to be correct. Indeed, for a calculator, the opposite is true. Even though you can replay a calculator from a log of its inputs, you cannot model a calculator as something that computes from pre-existing inputs, since the inputs depend (via the external human) on previous outputs.

Actors are inherently distributed and asynchronous, and so are pervasively non-deterministic. Actors successfully idealize all non-determinism as "arrival order non-determinism": Messages race towards an actor from multiple asynchronous sources. As Einstein would insist, these messages had no prior determined intrinsic full order. Nevertheless, they are delivered to the recipient actor in some *partially arbitrary* fully serialized order. In order to replay an actor, you not only need the messages sent to it by the external world, you need to know how these races were resolved.

An E Vat is a lot like an Actor, and at Vat-to-Vat granularity, E computation is a lot like Actor computation. Within a Vat we have approximately the singly threaded synchronous deterministic world of Von Neumann. It seems this is a good enough approximation that we could, under constrained but useful circumstances, arrange to segregate all the non-deterministic inputs to a Vat (or a subgraph of objects within a Vat) such that the remainder of the Vat's computation is deterministically replayable from these non-deterministic inputs. Why might we want to *eventually* do this?

  1. Cheaper Commitment.

Databases often use this technique by writing a "transaction log". Once a transaction request has been successfully logged, the transaction can be committed even though a crash may destroy its result state, given that the transaction can be deterministically reproduced from the logged request. I first saw this technique applied to committing general purpose computation by Rob Strom: Do a full checkpoint occasionally, and cheaply log inputs until you take the next checkpoint.

It's not clear whether this wins over KeyKOS/EROS mmu-based checkpointing, since that's also incremental. But is sure beats the crap out of non-incremental checkpointing, which is the only reasonable alternative for those sitting above the jvm.

2) Cheaper Fault Tolerance.

This is really a variation of #1, where the understudy is separated in space rather than time. I first saw this technique applied by Auragen (which since disappeared), and since rather brilliantly by Rob Strom.

3) Inward Bit Confinement.

Once we've plugged all means of communication accounted for by causality in our computational-model (or system-specification), all remaining ability to communicate must ride on informational inputs that are modelled as uncaused ("spontaneous"?). These are exactly the non-deterministic inputs to computation. In the confinement scenario http://www.erights.org/elib/capability/confinement.html , Alice generally isn't in a position to constrain Mallet (or verify that he is constrained), but she is in a position to constrain Bob (or verify that he is constrained).

If she constrains Bob to be deterministic only except for non-deterministic inputs that she believes Mallet cannot influence, then she knows that Bob cannot receive covert information from Mallet. Therefore, Bob cannot receive instructions from Mallet for making use of those powers given to him by Alice, but which Alice wishes Mallet not to exercise. Even though Bob wishes to use these powers to do Mallet's bidding, he cannot find out what that bidding is.

4) Debugging.

When a compiler demonstrates a bug, since you presumably have the program which elicited the bug, you can reproduce it. Assuming your debugger doesn't effect computation as experienced by the compiler, you should be able, after the fact, to reproduce the bug under the debugger; even though it's later, the bug didn't happen under the debugger, and no extra cost was spent to instrument the compiler. This effect, normally taken for granted, shows the power of deterministic replay at its best.

For non-deterministic computation, if we don't log non-deterministic inputs, we can't replay past computations. Such a non-deterministic program may drive us crazy with bugs that occasionally manifest themselves, but that we can never catch under the debugger, because it happens to take other paths then.

If we are willing to pay the instrumentation costs to log these inputs, then we can deterministically replay the remainder of the computation under a debugger, and watch a previous bug occur under the scrutiny of the debugger. Unfortunately, these instrumentation costs can be large. Fortunately, tricks #1 (commit) or #2 (fault tolerance) would already have us pay these costs, because they're still cheaper than the alternative.

For example, one can combine these techniques to build a server that is extremely resilient to its own bugs, and that really helps in finding these bugs. (We came up with this plan originally at EC, but never came close to implementing it.) The server process occasionally checkpoints, and also logs all non-deterministic inputs needed to replay it from one checkpoint to the next. It keeps at most three checkpoints on disk simultaneously. In order from oldest to youngest:

  1. the one committed to.
  2. the one being checked.
  3. the one being written out.

as well as the two log-sequences a..b and b..c. Unlike scenario #1, the log doesn't let us commit at finer granularity that a full checkpoint. Rather, while it typically allows us to recover (from a crash) at this fine granularity, our only real commitment is to checkpoint #a. Once #b is completely written out, it is read-in/revived in a checking process which asks all the objects which can to check their internal consistency. If all these objects say everything's cool, we commit to #b and throw away #a and a..b. If we're not cool, then we save away #a, a..b, and #b for debugging, crash the server, revive it from #a, and continue the service forward.

This time, we hope that non-determinism will allow the live service to avoid hitting the bug again. We can even use our knowledge of where the non-determinism lies to ensure that other correct choices are made whenever possible, to make path divergence more likely. The very irreproducibility of non-deterministic bugs, that's normally such a liability to robustness, is potentially an asset for staying in operation while we figure out what's wrong.

To do this figuring out, we've got #a, which reported itself to be consistent, or we wouldn't have committed to it. We've got #b, with a known inconsistency. And we have the a..b log, which should allow us to scrutinize the computation as it proceeds from an apparently consistent state to an inconsistent one. Given the E model that non-deterministic inputs only enter as discrete events on the event loop, and given enough spare cycles, one could even automatically isolate, with the self-consistency checking, the event at which inconsistency becomes evident. Only the execution of this individual event would then need to be stepped through manually in a debugger.

5) Contract Verification.

This is a new discovery, and is the most interesting. It will be covered in later email.


Hash Tables and Free Will

Also to be covered in future email:

I don't expect to do any of these immediately, but #3 (confinement) and #5
(contract verification) do become relevant in the near future. It turns out
that #5 has a lot in common with #1 (checkpoint), #2 (fault tolerance), and #4 (debugging), but #3 (confinement) is curiously different. To do any of these, we need to do a good job identifying all the hidden sources of non-determinism. I thought I had, but recently hit a rude surprise: hash table enumeration order. Do KeyKOS and EROS also have this covert channel?
(I think so.) Was this channel identified? Are it's dangers understood?

To plug this, I came up with a scheme, refined by Ping, to make hash tables deterministic. It's cheap but it's not free. However, it does force a non-upwards compatible change to the determination of whether two ConstMaps are the same. In all other ways, the proposed hash tables are upwards compatible to the current ones. I look forward to writing this up.

	Cheers,
	--MarkM