[E-Lang] remote comms: Timeouts and Connection Failure

zooko@zooko.com zooko@zooko.com
Sat, 21 Apr 2001 11:38:10 -0700


 Alan Karp wrote:
>
> We are in agreement.  Keepalives are good for turning potential silent
> failures from NAPs into BURPs, and I am not aware of any harm, other than
> resource consumption, from using keepalives.


Actually I consider keepalives to be a tradeoff -- they yield earlier and more
aggressive prediction of failure at the cost of more false alarms and the
network usage.

But the thing that I *really* don't like about them isn't the false alarms or
the network usage, which we all understand pretty well already, but how they
make it *less* likely that an app handles comm failures safely.

The thing is that keepalives can *not* eliminate all NAPs, allowing you to have
simpler code with more tractable failure modes.  Instead they only eliminate
*some* NAPs, which means you end up with *more* failure modes (the BURP handler
and the NAP handler), and the one that is the only really reliable one (NAP
handler == locally governed top-down timebombs) is the one more likely to be
neglected because it rarely gets exercised in normal operation.  In fact, in
many cases programmers who are using a "connection" abstraction will neglect the
need to handle NAPs entirely.  My intuition was that this would be the case in
MarcS's E apps, but apparently MarcS was smart enough to recognize the two
situations in which NAPs could do harm and implement a local impatience policy
to handle it.


So, cutnpasting from an earlier letter of mine, here are the conditions in which
a keepalive impatience policy is the right choice:

#1. I want to predict non-response as soon as possible, even at the cost of
    false alarms.

#2. False negatives (predicting a message which never comes) are okay because:
    A. I trust the counterparty never to do this to me.
     or
    B. There is an additional, local, impatience policy at work which will
     handle this case.

#3. I'm willing to spend the network and computational resources to do
    unforgeable proven-fresh keepalives.


I think we are all aware of the issues of false alarms and of resource
consumption, so I would like to especially draw attention to issue #2 -- false
negatives.  It seems to me that if keepalive is the default policy, and a local 
top-down impatience policy has to implemented "manually", that a programmer may
ignorantly violate requirement #2.

This was why I suggested the thought experiment of "what if I started getting
NAPs instead of BURPs, would that hang my app?".  MarcS performed this
experiment for his apps and reported back that in most cases it wouldn't hurt at
all and in the two cases where it might hurt there was also a
locally-implemented impatience policy to handle the NAP.

I was surprised at how little unhandled NAPs were able to hurt his apps, but
perhaps this is simply because he is a clueful enough programmer that he
designed the apps to handle NAPs.  I also questioned whether he needed
keepalives at all -- if a NAP didn't hurt, then what was the point of the
keepalives?  My intuition is that you need keepalives if and only if you also
need to handle NAPs.  If that were true, then all comms that MarcS's apps do
which are immune to NAPs would work just as well without keepalives.

MarcS also reported that in the Edesk app that requirement #1 was very strong --
you want to predict failure as early as possible so that the user sees that the
desk is unavailable before he tries to use it.  I strongly agree that this
user-driven requirement exists for this kind of app and that we should support
it as much as possible.

(Although you have to be careful about what the user perceives when he has sent
a command message *just* before the keepalive policy announces failure.  I
queried MarcS about this on the phone and it sounded like he used the same
technique as with the stock market -- the user initiates "reconnect", does a
state-check, and then decides whether to re-attempt whatever it was that they
were attempting.)


Now, I completely agree that for some applications this tradeoff is a win.  As
long as the value of more early and aggressive prediction outweighs the costs
(most subtly, the cost of the programmer having to solve issue #2), then it
would clearly be better to have keepalives than not.


However, it may be that there is an abstraction that we can offer the programmer
which could minimize or eliminate the problems raised in issue #2.


I am glad that we are discussing this subject.  I think that it is extremely
important and that there are non-obvious but important consequences of the
interactions between comms, security, persistence, and capability UI.  I think
there is a lot of meat left to this subject and that we are *not* anywhere near
the point of "conversational diminishing returns" where there is nothing left to
say except to restate our respective preferences.


I have been working for many days on an essay contrasting security properties of
various "retrying" techniques.  It seems like I can't catch up to the
conversation!  Notably MarcS's and MarkM's discovery of the "put a revokable
proxy in front" technique, which deserves to be included in the list of
techniques.


Regards,

Zooko