[E-Lang] remote comms: Timeouts and Connection Failure

zooko@zooko.com zooko@zooko.com
Fri, 20 Apr 2001 08:59:35 -0700

 Alan Karp wrote:
> Zooko wrote:
> >  (Anyway, it is a rare situation in which your counterparty
> >  fails to send a message that you are expecting, but keeps sending fresh
> >  keepalives.)
> This situation is rare if you are thinking of "connection" failures, but it
> is a common symptom of logic errors in the sending program.  Normally
> keepalives are sent by the low-level messaging code, while the message
> you're expecting is coming from the application.  If the application gets
> stuck in a loop, or misinterprets the request as something that needs no
> reply, you'll get your keepalives but not your message.  Maybe I write
> particularly bad code, but I see this behavior much more frequently than an
> undetected lost connection.

Good point, Alan.  There *is* an insidious failure mode here, where the
programmer's actual top-down requirement is "get the answer within 5 minutes or
else initiate this replacement operation", and the programmer implements this by
depending on the keepalive impatience policy, and every time he or anyone else
tests it, it works -- either the counterparty sends the answer right away, or
else comms failure or counterparty failure means that the keepalives stop

Then the bug can bite when the counterparty fails to send the answer but the
keepalives keep coming.

This was the kind of failure that I was alleging would show up if MarcS imagined
what would happen when the counterparty stopped sending messages.  It is the
"keepalive heuristic mispredicts, guessing that the answer is imminent but the
answer never comes" problem.  I (and MarcS) were originally thinking of it only
in the context of a malicious counterparty, but Alan's point is convincing: this
kind of misbehaviour seems like a common enough failure mode.


It boils down to:

  If you use the `stream of unforgeable keepalives' impatience policy, you are
  depending on the behaviour of your counterparty to implement your impatience
  policy.  This may or may not be correct behaviour from the point of view of
  your actual top-down requirements.