[E-Lang] remote comms: Timeouts and Connection Failure
Dean Tribble
tribble@e-dean.com
Mon, 02 Apr 2001 21:37:03 -0700
This took much longer to write than I expected. Please comment! (and I
hope it stayed coherent :-)
------------
There are a few threads floating around about impatience, timeouts on
promises, connection failure, and so forth. This message attempts to
respond to the whole thread by de/reconstructing the issue.
Timeouts are almost always motivated by either bottom-up concerns or
top-down concerns. The primary bottom-up concern on this list is
synthesizing a virtual connection from an unreliable underlying delivery
mechanism, but similar concerns appear wrt other computational resources;
e.g., is my computation going to finish or starve, am I getting enough
bandwidth, etc. For concreteness, I will focus discussion below on
comm-bsed bottom-up motivations. Top-down concerns are generally driven by
requirements; e.g., if my auction doesn't respond within a 5 seconds,
customers will go away (or sometimes worse, call support).
I will now make some strong claims. There are a few exceptions to them,
which I challenge the reader to provide :-)
g) *All* timeouts are either for bottom-up or top-down reasons. Everything
in the middle is confused.
h) Bottom-up timeouts are meta-information about the deployment
environment, and should *never* appear in application code.
i) Top-down timeouts are *unrelated* to comm.
j) The "every ref is a sturdy ref" issue is unrelated to time-outs.
Hopefully the above points are the second kind of obvious: after they are
explained they will hopefully be obvious :-) On to the explanation.
g) *All* timeouts are either for bottom-up or top-down reasons. Everything
in the middle is confused.
This is actually the hardest/longest to explain, but will have nice
examples :-> There are four sources of time issues for any software
system: system requirements, hardware requirements, common knowledge
requirements, and the design.
- System requirements are all top-level (e.g., fire the missile in 5
seconds or kiss...).
- Hardware time requirements are all bottom-up requirements for continued
operation, e.g., refresh memory every N cycles or lose memory. These are
all part of the infrastructure, and given any kind of abstraction support,
should be invisible to the application programmer, with the possible
exception of design rules (e.g., no unbounded loops).
- Common knowledge requirements result from the *fundamental* impossibility
in modern hardware of having a component X connected by a wire to component
Y know a fact F (e.g., the transaction to transfer money from X to Y is
committed, or the connection is still up), and know that: Y knows F, and Y
knows that X knows F, and Y knows that X knows that Y knows F, etc. This
problem is also called the Byzantine Generals problem if you want more
information. Time is often used as a finesse for this problem, for
example, 2-phase commit uses time-outs for distributed transactions. TCP
and Pluribus use connection keep-alives to determine that the connection is
still up. Because the use of time windows to establish common knowledge is
only to make up for the inability to do so directly, it is always a
bottom-up requirement.
- Design
To address one of the driving sources of time-requirements above, system
designs will often explicitly address time issues in multiple locations in
the design. For example, If A calls B, and A must respond within T
seconds, B could abort if it hasn't finished within T seconds because it
knows that A no longer cares. Other than for optimization, this is always
motivated by one of the other three source of time issues, and therefore
reduce to the previous categories and are either driven by bottom-up needs
or top-down needs. BTW: the design results that appear separately
motivated are usually attempts to establish common knowledge within the
application behavior.
h) Bottom-up timeouts are meta-information about the deployment
environment, and should *never* appear in application code.
Most of the impatience proposals have been about using time explicitly
within application code to address a common knowledge issue. This is
*always* the wrong thing. The *knowledge* is application/domain
specific. The *establishing* that it is common is always about the
meta-information. That is a separate concern, and should always be
addressed separately. Note that I do not mean that it should not be
addressed; but it is very separate, and code which combines the two is much
more complicated, and substantially less robust. I will illustrate by
continuing the common knowledge discussion above:
The knowledge of what time period to use for finessing common knowledge is
*necessarily* dependent on the deployment environment. For example, 10
seconds could be a perfectly reasonable window for connect keep-alives on
an Ethernet. On the Internet, 10 seconds might be just long-enough to pass
cursory testing and really hurt you when you deploy to customers (because
Internet latency is often higher than that). To a satellite it is of
course a ludicrously short time-out. But on that giant cluster-server with
shared-memory streams, it is too *long* by orders-of-magnitude, and could
result in substantially lower system performance and reliability. And yet
we want to use the same code across all of those environments (and indeed
most big Web applications are in at least 3 of the 4). Thus, the
time-window information is meta-information about the comm environment, and
should be used by the virtual connection abstraction (e.g., TCP) to
synthesize connections on the data-gram layer. The application code is in
terms of this abstraction (or rather, the object-layer abstraction above
it). The deployment of the application is where the application rubber
meets the time-dependent road (so BTW the Pluribus time-outs need to be
configurable).
i) Top-down timeouts are *unrelated* to comm.
User/system requirements don't care about comm. If you have a trading
application in which the service must react within 20 seconds, it doesn't
matter whether the backend network is slow, the database is slow, the JVM
is garbage collecting, the connection is down, the sun is shining,
etc. Once the server has that request, the user requirements timer is
ticking, and if it gets ticked off, the server should respond with the
appropriate apologies page. Every time a designer wants to put in a
time-sensitive element, the question is "Why?", and the answer is always to
move the time-sensitivity to the place indicated by the answer, and ask the
question again. System time windows should always be implemented as a race
between the request action and the time-delayed recovery action; the winner
gets to happen, adn the loser is ignored (or cleaned up).
Load shedding is similarly meta and therefore separable: If X calls Y, and
Y gets too busy, the right thing (in at least Web environments, phone
switches, network routing, plant scheduling...) is for a strategy on X to
notice that, and then shed load *coming in* to X, not start breaking
promises at the boundary between X and Y.
Note that time-outs associated with specific promises are very much
in-band, and not meta. They only coincidentally have anything to do with
top-down timing requirements.
j) The "every ref is a sturdy ref" issue is unrelated to time-outs.
There are two issues tangled together in the ERiaSR discussions: -2) the
granularity of connectedness and -3) the relation between reference
integrity and the low-level connection.
j -2) Connectedness
This issue is about whether there is value in the notion of a connection
between an object's client and that object (in which said connection can
break).
Zooko said:
"My (perhaps controversial) assertion is that there is no real useful
notion of a "connection" on an untrusted network other than whether or not
you have given up on waiting."
This is an easy bug to slide into. The question is whether you had a
reason to wait in the first place. Or, stated even more confusingly, a
connection is not the uncertainty of being connected, but rather the
absence of certainty that you are not. Even though I may not be sure of
the next message, I can be very sure that I am *not* in the middle of an
SSL session if I never went through the steps to set one up.
More clearly, establishing a connection is about building common knowledge
that you are connected. As above, hardware only lets you establish lesser
conditions that only approximate common knowledge, but like all
cooperation, they are sufficiently robust illusions billions of people can
succeed with them.
The illusion of connection is so valuable, and the failure of that illusion
so significant, that any system that does not let the programmer
distinguish between the two is fundamentally impairing their ability to
program in a distributed system. All that disbelieving the illusion of
connection does is impose all the uncertainty of communication on all
requests, to no advantage. It is much like requiring that all circuits be
designed as analog, and discarding then digital logic toolkit.
j -3) Disconnect and reference integrity
The remaining question is what happens when the connection goes away. This
is a completely separable issue, and indeed E takes a different stance than
Joule. In Joule, when a connection is broken, the individual remote
references do not find out about it. Instead, the keeper for that
connection finds out about it. Keeper is a KeyKOS concept for an
out-of-bad exception handler that might be able to address meta-issues
about the computation and let it proceed; e.g., a page manager for pages
not in main memory. The keeper then implements a recovery strategy if
there is a coherent one, like find where the object moved to, go to a
replica, terminate the entire program, break all the promises, etc.
As I recall the motivating case for breaking all the promises is
amnesia. Amnesia is when a service fails, restarts from an earlier
checkpoint, and believes it is up-to-date even though it has forgotten many
events. You would think it would be easy for it to discover that it was
confused with sequence numbers or some such, but what if two communicating
machines *both* had amnesia? The other reason is that many objects are
ephemeral, and it would be difficult to figure out how to get them
reconnected. So the simplest thing which is easy to explain, easy to
implement, and easy to assure correctness of of is breaking all the promises.
The intermediate solution that I prefer is one in which there is a
connection keeper, and the default keeper breaks all the promises. That
would allow experimentation with alternate strategies. This is
conveniently upwards compatible with E's current semantics....
Another intermediate solution is to not break refs declared sturdy, but
break ephemeral refs. MarkM: are there reasons other than the above why
that wouldn't be as good?
Well that's enough explanation. It's all obvious now, right?
:-) dean