[cap-talk] Session failures
Jed Donnelley
jed at nersc.gov
Fri Mar 14 18:47:24 EDT 2008
On 3/14/2008 11:16 AM, Jonathan S. Shapiro wrote:
> On Fri, 2008-03-14 at 11:02 -0700, Jed Donnelley wrote:
>> For my curiosity, how is it that despite the above sorts
>> of checkpoint/recovery mechanisms our two Generals still
>> can't agree whether or not to attack? (perhaps better
>> off the list? Maybe there's a reference you can point
>> me to?)
>
> Short answer: it's because the problem constraints aren't quite the
> same. The checkpoint/restart mechanism avoids the problem by eliminating
> any requirement for synchronized agreement.
Thanks. I guess I need to understand more about the problem
that the checkpoint/restart mechanism does solve. Perhaps
a better step to resolving my confusion in this area would be
to look at distributed commit. I haven't had time/opportunity
to really come to grips with how distributed commit can work
when synchronized agreement can't.
It's always difficult to know how much time to spend looking
into such nagging/dangling topics when other commitments
press. If I could be confident of coming to an understanding
of this topic in, say, 1/2 days reading, I'd be happy to
make that investment. Think that's possible? Anybody have
any suggested reading? I think I'd be content understanding
how the simplest sort of two part 'agreement' protocol can
work in the face of the Two Generals Problem - which I am
only too familiar with (having been a "General" once...).
There are some pretty simple things that it seems can't
be achieved, such just knowing that a message got delivered.
You may get an ack and you may not. Retransmission may
help and it may not. You just have to live with and deal
with that possible ambiguity. Dealing with that fundamental
ambiguity was a common theme in our NLTSS system coding
(this much got sent and acked. This much got sent and not
acked. What would you like to do General sir?).
--Jed http://www.webstart.com/jed/
More information about the cap-talk
mailing list