requirements for comms (was: Re: [E-Lang] Security Breach: Nominee for the Stock Exchange Prize)
Thu, 26 Apr 2001 12:12:38 -0700

 Bill Frantz wrote:
> I am particularly interested in requirements for impatience policies which
> are longer than the impatience policy imbedded in TCP and VatTP.  If there
> exists an important class of applications with these long timeouts, then it
> may be worth while build mechanisms to allow these relations to survive the
> failure of TCP connections.  Even in the current E, these policies can be
> built at the application level with sturdy refs (and a bunch of code for
> ordering, retransmission, and all the other stuff the comm system is
> supposed to be doing).

Yes, this is my concern also.  My intuition tells me that in a language in

 * all objects are persistent, and
 * the programmer's abstraction of remote messages looks the same as object
   method invocation, and 
 * the delivery or non-delivery of a set of capabilities is the _only_ mechanism
   for implementing security policies,

that many applications will need to force retrying, ordering, etc., of a set of
messages despite the objection of the keepalives impatience policy, which
guesses that messages probably aren't being delivered right now.

My intuition tells me that if E doesn't provide this as part of its method
invocation/comms abstraction, that I will have to re-invent it on top of E, as
you suggest above.  

To my mind, MarcS's stock market application is an example of an application
that "naturally" needs persistent retrying and ordering, and MarcS's solution
was an example of reinventing persistent retrying and ordering on top of E.
(But note that his solution has a feature that an under-the-hood solution
probably would not: that you can keep only a single capability as your state,
move to a new computer, and reconnect to the stock market broker.)

> >This situation is rare if you are thinking of "connection" failures, but
> >it is a common symptom of logic errors in the sending program. Normally
> >keepalives are sent by the low-level messaging code, while the message
> >you're expecting is coming from the application. If the application gets
> >stuck in a loop, or misinterprets the request as something that needs no
> >reply, you'll get your keepalives but not your message. Maybe I write
> >particularly bad code, but I see this behavior much more frequently than
> >an undetected lost connection.
> It seems to me that in the case where the application classifies the
> request as "no reply needed", that the application would continue to
> respond to application level keep alives.  There is probably no solution
> short of application level timeouts, and getting those right is hard, as
> Zooko notes.  (The development history of TCP shows quite a bit to timeout
> tuning too.)

Bill: I don't understand this paragraph.  Alan had suggested that a common
failure mode is that the E runtime continues to operate correctly and to
deliver unforgeable proven-fresh keepalives, but that the E code running on top
of it has either gotten hung or has forgotten that it was supposed to respond.
In that case, the "keepalives impatience policy" will not raise an alarm, but
the desired response will never come.  This is how the keepalives policy can
falsely predict a response which never comes, and normal "connection breaking"
exceptions are how the keepalives policy can falsely predict failure.

> How much of the problems in Mojo Nation were due to "too short" timeouts?

It is an interesting question.  Almost everything that Mojo Nation does can be
done with some subset of the available counterparties, so if you need a response
from K different counterparties, and you have sent queries to K of them, and
then you get K-1 responses back, and then time passes, is it better to keep
waiting in case the Kth response arrives, or to send a query to a new, K+1th
counterparty?  If you do the former, you are waiting.  If you do the latter, you
may end up congesting your own connection with unnecessary extra queries and

One important step forward was to separate the question of "when do we initiate
a replacement operation?" from "when do we decide not to process the
response?".  In other words, there is no need to ignore response K, should it
arrive, just because we have already sent out query K+1.

Now originally we unified launching a replacement operation with "from now on we
will ignore this response, should it ever arrive".  That is because we were
instinctively thinking that by "closing" the Kth "connection", we could prevent
the Kth response from arriving and thus conserve resources.  But as a matter of
fact, we can't prevent the Kth response from arriving, we can only choose to
drop it on the floor after we have read the whole thing in over the wire, and
decrypted and verified it.  This is because our messages can transport over
"relay servers" as well as over TCP, (and perhaps in the future over HTTP, UDP,
etc.), and the relay servers do not offer any "connection management" features,
they just offer simple store-and-forward of messages.

Anyway, you asked how much of the problems in Mojo Nation were due to "too
short" timeouts, and my answer is that the problem was that timeouts triggered
two separate things, so they were always either too short or too long.  Now that
we launch replacement operations separately from "refuse to ever process this
response, should it arrive", I want the former to happen at a time dynamically
chosen to be 97% sure that it would have arrived before now if it were coming
and I want the latter to be "Whenever we need to reclaim some memory
resources.".  So the current hard-coded "60 seconds" timeout is currently too
long for the former and too short for the latter.