[cap-talk] Session failures

Jed Donnelley jed at nersc.gov
Fri Mar 14 14:02:58 EDT 2008


On 3/14/2008 9:28 AM, Jonathan S. Shapiro wrote:
> On Fri, 2008-03-14 at 09:14 -0700, Jed Donnelley wrote:
>> <right here lies the Two Generals problem:
>> http://en.wikipedia.org/wiki/Two_Generals'_Problem
>>
>> which I think helps to clarify the difficulty, indeed
>> the General insolubility.>
> 
> Recovering consistency from independently taken snapshots turns out to
> be solvable. In the case of checkpoints, Lamport's work on Causal
> ordering provides the formal basis for the solution, and Jefferson's
> work on time warp inspired the concrete design.
> 
> There is a global causal ordering of epochs across the distributed
> system that is induced by the transmission and arrival of messages
> between participating nodes. If this graph is explicitly maintained,
> then rollback and replay can be used to cope with localized failure.
> 
> One can imagine such a system as having a leading temporal edge defined
> by new computation and a trailing temporal edge behind which no epoch is
> hazarded by rollback because all dependencies have been committed.
> 
> Aside from having a system in which this fact could be usefully
> exploited in practice, the only interesting point in my approach was to
> note that in the case of distributed checkpoint, it was possible to
> guarantee that the rate of advance of the two edges can be made to be
> correlated, and that unbounded rollback can therefore be prevented.
> There were also a few tricks for collapsing epochs to reduce the amount
> of checkpointing state required to maintain the scheme.

For my curiosity, how is it that despite the above sorts
of checkpoint/recovery mechanisms our two Generals still
can't agree whether or not to attack?  (perhaps better
off the list?  Maybe there's a reference you can point
me to?)

--Jed  http://www.webstart.com/jed/



More information about the cap-talk mailing list