requirements for comms (was: Re: [E-Lang] Security Breach: Nominee for the Stock Exchange Prize)

Bill Frantz frantz@pwpconsult.com
Fri, 27 Apr 2001 17:55:22 -0700


At 12:12 PM -0700 4/26/01, zooko@zooko.com wrote:
> Bill Frantz wrote:
>> >This situation is rare if you are thinking of "connection" failures, but
>> >it is a common symptom of logic errors in the sending program. Normally
>> >keepalives are sent by the low-level messaging code, while the message
>> >you're expecting is coming from the application. If the application gets
>> >stuck in a loop, or misinterprets the request as something that needs no
>> >reply, you'll get your keepalives but not your message. Maybe I write
>> >particularly bad code, but I see this behavior much more frequently than
>> >an undetected lost connection.
>>
>> It seems to me that in the case where the application classifies the
>> request as "no reply needed", that the application would continue to
>> respond to application level keep alives.  There is probably no solution
>> short of application level timeouts, and getting those right is hard, as
>> Zooko notes.  (The development history of TCP shows quite a bit to timeout
>> tuning too.)
>
>Bill: I don't understand this paragraph.  Alan had suggested that a common
>failure mode is that the E runtime continues to operate correctly and to
>deliver unforgeable proven-fresh keepalives, but that the E code running
>on top
>of it has either gotten hung or has forgotten that it was supposed to respond.
>In that case, the "keepalives impatience policy" will not raise an alarm, but
>the desired response will never come.  This is how the keepalives policy can
>falsely predict a response which never comes, and normal "connection breaking"
>exceptions are how the keepalives policy can falsely predict failure.

At one point, application level keep-alives were suggested as an
alternative to system level ones.  I was showing that application level
keep-alives probably would have the same problem as system level ones in
the case where the requestor expected a response, and the requestee decided
that no response was necessary.


>> How much of the problems in Mojo Nation were due to "too short" timeouts?
>
>It is an interesting question.  Almost everything that Mojo Nation does can be
>done with some subset of the available counterparties, so if you need a
>response
>from K different counterparties, and you have sent queries to K of them, and
>then you get K-1 responses back, and then time passes, is it better to keep
>waiting in case the Kth response arrives, or to send a query to a new, K+1th
>counterparty?  If you do the former, you are waiting.  If you do the
>latter, you
>may end up congesting your own connection with unnecessary extra queries and
>responses.
>
>One important step forward was to separate the question of "when do we
>initiate
>a replacement operation?" from "when do we decide not to process the
>response?".  In other words, there is no need to ignore response K, should it
>arrive, just because we have already sent out query K+1.

That makes sense.  I don't know that E will be supporting the Mojo Nation
abstraction of, "Many sources, any of which can provide a useful response"
as a basic primitive.  That is the environment where the Mojo Nation
timeout dilemma becomes most acute.

>
>...
>
>Anyway, you asked how much of the problems in Mojo Nation were due to "too
>short" timeouts, and my answer is that the problem was that timeouts triggered
>two separate things, so they were always either too short or too long.
>Now that
>we launch replacement operations separately from "refuse to ever process this
>response, should it arrive", I want the former to happen at a time dynamically
>chosen to be 97% sure that it would have arrived before now if it were coming
>and I want the latter to be "Whenever we need to reclaim some memory
>resources.".  So the current hard-coded "60 seconds" timeout is currently too
>long for the former and too short for the latter.

In the long run, someone will have to choose timeout values.  I hope we can
give them some useful guidance.

One thing we could provide is the recent minimum observed round-trip time
on a path.  Than people could avoid a 60 second timeout on a Mixmaster
email path.  :-)

Cheers - Bill


-------------------------------------------------------------------------
Bill Frantz       | Microsoft Outlook, the     | Periwinkle -- Consulting
(408)356-8506     | hacker's path to your      | 16345 Englewood Ave.
frantz@netcom.com | hard disk.                 | Los Gatos, CA 95032, USA