requirements for comms (was: Re: [E-Lang] Security Breach:
Nominee for the Stock Exchange Prize)
Bill Frantz
frantz@pwpconsult.com
Fri, 27 Apr 2001 17:55:22 -0700
At 12:12 PM -0700 4/26/01, zooko@zooko.com wrote:
> Bill Frantz wrote:
>> >This situation is rare if you are thinking of "connection" failures, but
>> >it is a common symptom of logic errors in the sending program. Normally
>> >keepalives are sent by the low-level messaging code, while the message
>> >you're expecting is coming from the application. If the application gets
>> >stuck in a loop, or misinterprets the request as something that needs no
>> >reply, you'll get your keepalives but not your message. Maybe I write
>> >particularly bad code, but I see this behavior much more frequently than
>> >an undetected lost connection.
>>
>> It seems to me that in the case where the application classifies the
>> request as "no reply needed", that the application would continue to
>> respond to application level keep alives. There is probably no solution
>> short of application level timeouts, and getting those right is hard, as
>> Zooko notes. (The development history of TCP shows quite a bit to timeout
>> tuning too.)
>
>Bill: I don't understand this paragraph. Alan had suggested that a common
>failure mode is that the E runtime continues to operate correctly and to
>deliver unforgeable proven-fresh keepalives, but that the E code running
>on top
>of it has either gotten hung or has forgotten that it was supposed to respond.
>In that case, the "keepalives impatience policy" will not raise an alarm, but
>the desired response will never come. This is how the keepalives policy can
>falsely predict a response which never comes, and normal "connection breaking"
>exceptions are how the keepalives policy can falsely predict failure.
At one point, application level keep-alives were suggested as an
alternative to system level ones. I was showing that application level
keep-alives probably would have the same problem as system level ones in
the case where the requestor expected a response, and the requestee decided
that no response was necessary.
>> How much of the problems in Mojo Nation were due to "too short" timeouts?
>
>It is an interesting question. Almost everything that Mojo Nation does can be
>done with some subset of the available counterparties, so if you need a
>response
>from K different counterparties, and you have sent queries to K of them, and
>then you get K-1 responses back, and then time passes, is it better to keep
>waiting in case the Kth response arrives, or to send a query to a new, K+1th
>counterparty? If you do the former, you are waiting. If you do the
>latter, you
>may end up congesting your own connection with unnecessary extra queries and
>responses.
>
>One important step forward was to separate the question of "when do we
>initiate
>a replacement operation?" from "when do we decide not to process the
>response?". In other words, there is no need to ignore response K, should it
>arrive, just because we have already sent out query K+1.
That makes sense. I don't know that E will be supporting the Mojo Nation
abstraction of, "Many sources, any of which can provide a useful response"
as a basic primitive. That is the environment where the Mojo Nation
timeout dilemma becomes most acute.
>
>...
>
>Anyway, you asked how much of the problems in Mojo Nation were due to "too
>short" timeouts, and my answer is that the problem was that timeouts triggered
>two separate things, so they were always either too short or too long.
>Now that
>we launch replacement operations separately from "refuse to ever process this
>response, should it arrive", I want the former to happen at a time dynamically
>chosen to be 97% sure that it would have arrived before now if it were coming
>and I want the latter to be "Whenever we need to reclaim some memory
>resources.". So the current hard-coded "60 seconds" timeout is currently too
>long for the former and too short for the latter.
In the long run, someone will have to choose timeout values. I hope we can
give them some useful guidance.
One thing we could provide is the recent minimum observed round-trip time
on a path. Than people could avoid a 60 second timeout on a Mixmaster
email path. :-)
Cheers - Bill
-------------------------------------------------------------------------
Bill Frantz | Microsoft Outlook, the | Periwinkle -- Consulting
(408)356-8506 | hacker's path to your | 16345 Englewood Ave.
frantz@netcom.com | hard disk. | Los Gatos, CA 95032, USA