[E-Lang] remote comms: Timeouts and Connection Failure

zooko@zooko.com zooko@zooko.com
Fri, 20 Apr 2001 08:23:58 -0700


[I was in the shower and I suddenly started wondering about this old message in
which MarcS had asserted that none of his E apps could be hung by a counterparty
falling silent.  Here, then is a follow-up.  This is *not* my big essay on comms
and security.  This is just a few ideas I had just now.  --Z]


   I, Zooko, wrote the things quoted with "> > ":

 MarcS wrote:
>
> > I think that if a programmer is given a connection abstraction, then the common
> > case is definitely to give up as soon as the comms implementation says that the
> > counterparty is non-responsive (by "breaking" the "connection").  *But*,
> > I think that this is usually a mistake, because usually higher-level
> > consideration (most often, the user) should determine the impatience policy
> > top-down.  As a thought experiment, consider how many apps could be effectively
> > hung or DoS'ed by a malicious counterparty sending all the right keepalives but
> > otherwise not doing anything.
> 
> Well, I do the thought experiment on the following E apps I have written.
> When I say, "nope, can't hang it", it means no part of the system is
> affected except the specific transactions that would  be "hung" ("broken")
> if the hostile service just shut down and didn't play keepalive games:
> 
> --marketplace: nope, can't hang it
> --Edesk: nope, can't hang it
> --E web server: nope, can't hang it
> --Echat: nope, can't hang it
> --Distributed eBrowser, where outline synchronization is on a remote hostile
> computer: a hostile server can hang the outline synchronization process,
> i.e., prevent infrastructural early-detection that outlines will never be
> synchronized. You still cannot hang the system, however. In this case, the
> user implements his own impatience policy by switching to another outline
> synchronization service (this is really a crazy example, outline
> synchronization should generally be done on a machine you trust as much as
> the source-editing machine, typically the same machine :-)
> --Satan at the racetrack: nope, can't hang it (though Satan at the racetrack
> does implement a top-down impatience policy in addition to the
> infrastructure impatience policy...but as demonstrated earlier, this is easy
> in E, liverefs or no).
> --The 5-party salesman smart contract: if the trusted contract server is
> hostile, it could use this strategy to hang the world. But a hostile trusted
> contract server can destroy the world in many many different ways, some far
> more malicious and undetectable than this. For the other parties to the
> transaction, this kind of attack would have no effect that they could not
> achieve above-board simply by rejecting their part of the contracting
> relationship. The most interesting desired attack would be for the vendor to
> DoS the salesman's attempt to receive his commission. Nope, can't stop it,
> with this or any other technique.
> 
> Generally, the worst a hostile process can do using this strategy is force
> you to keep context for the hostile piece of the distributed app hanging
> around .


I was just reflecting on this letter and how surprised I was at the answers you
gave.

If this is really true, it implies to me that for *each* counterparty in *every*
app, the requirement for responsiveness is either "nothing" or "a steady stream
of unforgeable keepalives", or "a steady stream of unforgeable keepalives plus
X", where X is another impatience policy that is also implemented (in `outline
synchronization' and `Satan at the racetrack').

>From your descriptions, quoted above, it seems like in most cases there *is* no
top-down requirement for a response -- either no response is expected at all or
else the top-down requirement is to wait forever until it arrives.  

>From these descriptions, it sounds like outline synchronization has a top-down
impatience policy initiated (and implemented) by the user and that "Satan at the
racetrack" has a top-down impatience policy implemented by the programmer.


Hm.


So I guess my next question is: if you don't need a response or you are willing
to wait forever for the response, then why do you need the steady stream of
unforgeable keepalives?

I suppose it is a heuristic -- if you *are* waiting for a response, you are
continually predicting whether you think the response will arrive or not, and
once you predict that it will not arrive (due to insufficient unforgeable
keepalives), then you give up and implement some kind of recovery procedure.


But I am still suspicious that this behaviour: "Use keepalives to try to predict
whether the response will ever arrive and then abandon your `handle response'
procedure and initiate your `handle non-response' procedure if you predict that
it won't." is not really a requirement of the user or a requirement of the
design, but is instead just being thrown in because that is what the connection
abstraction offers and that is what programmers are used to.


The most obvious objection to this impatience behaviour is that you might predict
wrongly -- if there is a transient communication failure, or if the remote Vat is
temporarily turned off, then you will subsequently refuse to accept the response
that you had previously been waiting for, even when it arrives.  (This was a
major source of performance problems in Mojo Nation for a while: the
"accidentally assumed connectiony semantics" bug to which I earlier alluded, in
which we would throw messages out if they arrived too late, even though we
actually desperately wanted the data contained within them.)

You can also predict wrongly the *other* direction, thinking that the steady
stream of unforgeable keepalives signals the imminent arrival of your answer, when
in fact the answer never comes.  But since they are unforgeable keepalives and
there is a cryptographic ordering guarantee implemented which covers both the
keepalives and the messages, this can only happen if your counterparty never
*sends* the message.

This latter misprediction is the thought experiment that MarcS did, above, but 
I guess mispredicting in *this* direction doesn't have any negative consequences
in MarcS's apps.  (Anyway, it is a rare situation in which your counterparty
fails to send a message that you are expecting, but keeps sending fresh
keepalives.)


Mispredicting the other way -- spuriously abandoning your state and losing the
ability to process the response message, simply because of a few seconds of comms
interruption or a temporarily suspended Vat -- *does* have negative consequences
in MarcS's apps, I am sure, but we have all grown so used to those little
recurring annoyances known as "broken connections" that we don't even think of
them anymore.


Okay, this has rambled a bit, but my point is that the "steady stream of
unforgeable keepalives" impatience policy gets used where it probably wasn't
required and causes occasional spurious failures when mispredicting one way, but
I guess MarcS's thought experiment has convinced me that it rarely causes
problems when mispredicting the other way.


Regards,

Zooko