[E-Lang] remote comms: Timeouts and Connection Failure

Dean Tribble tribble@e-dean.com
Mon, 02 Apr 2001 21:37:03 -0700


This took much longer to write than I expected.  Please comment!  (and I 
hope it stayed coherent :-)
------------
There are a few threads floating around about impatience, timeouts on 
promises, connection failure, and so forth.  This message attempts to 
respond to the whole thread by de/reconstructing the issue.

Timeouts are almost always motivated by either bottom-up concerns or 
top-down concerns.  The primary bottom-up concern on this list is 
synthesizing a virtual connection from an unreliable underlying delivery 
mechanism, but similar concerns appear wrt other computational resources; 
e.g., is my computation going to finish or starve, am I getting enough 
bandwidth, etc.  For concreteness, I will focus discussion below on 
comm-bsed bottom-up motivations.  Top-down concerns are generally driven by 
requirements; e.g., if my auction doesn't respond within a 5 seconds, 
customers will go away (or sometimes worse, call support).

I will now make some strong claims.  There are a few exceptions to them, 
which I challenge the reader to provide :-)

g) *All* timeouts are either for bottom-up or top-down reasons.  Everything 
in the middle is confused.

h) Bottom-up timeouts are meta-information about the deployment 
environment, and should *never* appear in application code.

i) Top-down timeouts are *unrelated* to comm.

j) The "every ref is a sturdy ref" issue is unrelated to time-outs.

Hopefully the above points are the second kind of obvious:  after they are 
explained they will hopefully be obvious :-)  On to the explanation.

g) *All* timeouts are either for bottom-up or top-down reasons.  Everything 
in the middle is confused.

This is actually the hardest/longest to explain, but will have nice 
examples :->  There are four sources of time issues for any software 
system:  system requirements, hardware requirements, common knowledge 
requirements, and the design.

- System requirements are all top-level (e.g., fire the missile in 5 
seconds or kiss...).

- Hardware time requirements are all bottom-up requirements for continued 
operation, e.g., refresh memory every N cycles or lose memory.  These are 
all part of the infrastructure, and given any kind of abstraction support, 
should be invisible to the application programmer, with the possible 
exception of design rules (e.g., no unbounded loops).

- Common knowledge requirements result from the *fundamental* impossibility 
in modern hardware of having a component X connected by a wire to component 
Y know a fact F (e.g., the transaction to transfer money from X to Y is 
committed, or the connection is still up), and know that:  Y knows F, and Y 
knows that X knows F, and Y knows that X knows that Y knows F, etc.  This 
problem is also called the Byzantine Generals problem if you want more 
information.  Time is often used as a finesse for this problem, for 
example, 2-phase commit uses time-outs for distributed transactions.  TCP 
and Pluribus use connection keep-alives to determine that the connection is 
still up.  Because the use of time windows to establish common knowledge is 
only to make up for the inability to do so directly, it is always a 
bottom-up requirement.

- Design

To address one of the driving sources of time-requirements above, system 
designs will often explicitly address time issues in multiple locations in 
the design.  For example, If A calls B, and A must respond within T 
seconds, B could abort if it hasn't finished within T seconds because it 
knows that A no longer cares.  Other than for optimization, this is always 
motivated by one of the other three source of time issues, and therefore 
reduce to the previous categories and are either driven by bottom-up needs 
or top-down needs.  BTW: the design results that appear separately 
motivated are usually attempts to establish common knowledge within the 
application behavior.

h) Bottom-up timeouts are meta-information about the deployment 
environment, and should *never* appear in application code.

Most of the impatience proposals have been about using time explicitly 
within application code to address a common knowledge issue.  This is 
*always* the wrong thing.  The *knowledge* is application/domain 
specific.  The *establishing* that it is common is always about the 
meta-information.  That is a separate concern, and should always be 
addressed separately.  Note that I do not mean that it should not be 
addressed; but it is very separate, and code which combines the two is much 
more complicated, and substantially less robust.  I will illustrate by 
continuing the common knowledge discussion above:

The knowledge of what time period to use for finessing common knowledge is 
*necessarily* dependent on the deployment environment.  For example, 10 
seconds could be a perfectly reasonable window for connect keep-alives on 
an Ethernet.  On the Internet, 10 seconds might be just long-enough to pass 
cursory testing and really hurt you when you deploy to customers (because 
Internet latency is often higher than that).  To a satellite it is of 
course a ludicrously short time-out.  But on that giant cluster-server with 
shared-memory streams, it is too *long* by orders-of-magnitude, and could 
result in substantially lower system performance and reliability.  And yet 
we want to use the same code across all of those environments (and indeed 
most big Web applications are in at least 3 of the 4).  Thus, the 
time-window information is meta-information about the comm environment, and 
should be used by the virtual connection abstraction (e.g., TCP) to 
synthesize connections on the data-gram layer.  The application code is in 
terms of this abstraction (or rather, the object-layer abstraction above 
it).  The deployment of the application is where the application rubber 
meets the time-dependent road (so BTW the Pluribus time-outs need to be 
configurable).

i) Top-down timeouts are *unrelated* to comm.

User/system requirements don't care about comm.  If you have a trading 
application in which the service must react within 20 seconds, it doesn't 
matter whether the backend network is slow, the database is slow, the JVM 
is garbage collecting, the connection is down, the sun is shining, 
etc.  Once the server has that request, the user requirements timer is 
ticking, and if it gets ticked off, the server should respond with the 
appropriate apologies page.  Every time a designer wants to put in a 
time-sensitive element, the question is "Why?", and the answer is always to 
move the time-sensitivity to the place indicated by the answer, and ask the 
question again.  System time windows should always be implemented as a race 
between the request action and the time-delayed recovery action; the winner 
gets to happen, adn the loser is ignored (or cleaned up).

Load shedding is similarly meta and therefore separable:  If X calls Y, and 
Y gets too busy, the right thing (in at least Web environments, phone 
switches, network routing, plant scheduling...) is for a strategy on X to 
notice that, and then shed load *coming in* to X, not start breaking 
promises at the boundary between X and Y.

Note that time-outs associated with specific promises are very much 
in-band, and not meta.  They only coincidentally have anything to do with 
top-down timing requirements.

j) The "every ref is a sturdy ref" issue is unrelated to time-outs.

There are two issues tangled together in the ERiaSR discussions: -2) the 
granularity of connectedness and -3) the relation between reference 
integrity and the low-level connection.

j -2) Connectedness

This issue is about whether there is value in the notion of a connection 
between an object's client and that object (in which said connection can 
break).

Zooko said:
"My (perhaps controversial) assertion is that there is no real useful 
notion of a "connection" on an untrusted network other than whether or not 
you have given up on waiting."

This is an easy bug to slide into.  The question is whether you had a 
reason to wait in the first place.  Or, stated even more confusingly, a 
connection is not the uncertainty of being connected, but rather the 
absence of certainty that you are not.  Even though I may not be sure of 
the next message, I can be very sure that I am *not* in the middle of an 
SSL session if I never went through the steps to set one up.

More clearly, establishing a connection is about building common knowledge 
that you are connected.  As above, hardware only lets you establish lesser 
conditions that only approximate common knowledge, but like all 
cooperation, they are sufficiently robust illusions billions of people can 
succeed with them.

The illusion of connection is so valuable, and the failure of that illusion 
so significant, that any system that does not let the programmer 
distinguish between the two is fundamentally impairing their ability to 
program in a distributed system.  All that disbelieving the illusion of 
connection does is impose all the uncertainty of communication on all 
requests, to no advantage.  It is much like requiring that all circuits be 
designed as analog, and discarding then digital logic toolkit.

j -3) Disconnect and reference integrity

The remaining question is what happens when the connection goes away.  This 
is a completely separable issue, and indeed E takes a different stance than 
Joule.  In Joule, when a connection is broken, the individual remote 
references do not find out about it.  Instead, the keeper for that 
connection finds out about it.  Keeper is a KeyKOS concept for an 
out-of-bad exception handler that might be able to address meta-issues 
about the computation and let it proceed; e.g., a page manager for pages 
not in main memory.  The keeper then implements a recovery strategy if 
there is a coherent one, like find where the object moved to, go to a 
replica, terminate the entire program, break all the promises, etc.

As I recall the motivating case for breaking all the promises is 
amnesia.  Amnesia is when a service fails, restarts from an earlier 
checkpoint, and believes it is up-to-date even though it has forgotten many 
events.  You would think it would be easy for it to discover that it was 
confused with sequence numbers or some such, but what if two communicating 
machines *both* had amnesia?  The other reason is that many objects are 
ephemeral, and it would be difficult to figure out how to get them 
reconnected.  So the simplest thing which is easy to explain, easy to 
implement, and easy to assure correctness of of is breaking all the promises.

The intermediate solution that I prefer is one in which there is a 
connection keeper, and the default keeper breaks all the promises.  That 
would allow experimentation with alternate strategies.  This is 
conveniently upwards compatible with E's current semantics....

Another intermediate solution is to not break refs declared sturdy, but 
break ephemeral refs.  MarkM:  are there reasons other than the above why 
that wouldn't be as good?


Well that's enough explanation.  It's all obvious now, right?

:-)  dean