[E-Lang] remote comms: Timeouts and Connection Failure

Marc Stiegler marcs@skyhunter.com
Wed, 4 Apr 2001 22:22:18 -0700


> > j) The "every ref is a sturdy ref" issue is unrelated to time-outs.
>
> Whoops -- there are definitely important interactions between ERiaSR,
> impatience policies, ordering guarantees, at-most-once guarantees, etc.
To
> wit:
>
> If ERwaSR, then the underlying comms implementation couldn't notify the
> application code that a "connection" were "broken", (i.e. that impatience
had
> been triggered) by smashing the LiveRefs.  I consider this to be a
*feature* of
> ERiaSR, because I don't want the comms implementation signalling *up* like
that
> -- I want impatience to be implemented solely in application code
(although of
> course with help from E) and I want the underlying comms implementation to
be
> incapable of breaking my promises in my application code.  (This is why
MarcS's
> custom impatience policy[1], while very cool, isn't good enough, because
now
> *either* the custom impatience policy *or* the underlying comms
> implementation's impatience policy can trigger my impatience handler.)

Do you not expect it to be the common case that a broken connection is a
good reason to stop waiting? This surprises me. And if the common case is
that you want to stop waiting when the connection breaks, then there are
ways of handling the uncommon case that are reasonably straightforward,
though not quite as easy as what I presented earlier (though, on second
thought, I may be able to abstract it out with a tool or two, the way basic
impatience abstracts out with a timebomb and a promiseFirstResolved tool).

In all my programs to date, I have only once had an impatience policy that
needed timeout policies other than the "connection-lost" policy (the one
exception was my prototype of the system for which Dean and Chris are
building a bigger better version at the moment). And in all the cases where
comm could be lost, I was interested in knowing that fact for one reason or
another, because I have required many different policies for reacting to
lost "connections".

At moments like these, I yearn to see you write a few E apps with E as it
is, with the sturdy/live separation, and see if your objections-in-theory
really are objections-in-practice. I have not had any experiences to suggest
the objections make it across the theory/practice boundary. Though of
course, all my experiences, while perhaps composing the bulk of all programs
written to date in E, is a pretty paltry test suite :-) Which is why you
should write some too :-)

> ... which means that I've been led, with much reluctance, back to wanting
a
> "SturdyRef/LiveRef" abstraction (since those little proxy objects might as
well
> be built-in to the language...), but with the important difference that
> LiveRefs are *never* broken by underlying comms implementation, only by
> application-level impatience policy, or maybe even not then.  Maybe they
are
> unbreakable.

Somehow, I feel like I just have to be able to find out that the
"connection" has "broken". Connection breakage is one of the basic truths of
distributed computing, and not being able to implement different policies
for different requirements in the presence of breakage just sounds like
madness to me. I don't want to have to create impatience policies for all my
distributed connections when in fact the only thing that breaks in practice
is the "connection"--even though I made it easy with promiseFirstResolved
etc., to create impatience policies, why make me do it over and over again?
Though I guess you are really arguing for replaceable-default impatience
policies, in which case I find myself saying, okay, but I have found the
current E impatience policy (based on a long-interval keep-alive) to be a
great default-default. I find the case for replaceable defaults positive and
interesting but not compelling for version 1.0 of E.

> I think that this behaviour should be considered as "guaranteeing faster
> discovery of non-responsiveness through the use of unforgeable
keep-alives",
> instead of as "determining that the connection is still up".  Clearly it
does
> *not* determine that the connection is "still up", only that it *was*
"still
> up" before the last keep-alive was received, and anyway I would like to
avoid
> using this abstraction of "connections still being up" entirely until I've
been
> convinced that we need it.

It not only guarantees faster discovery, but does it automatically and
routinely. This makes me quite fond of it. Once again, I feel like we have
at least a good default-default impatience policy.


> (In Mojo Nation, there are no operations which *require* fast discovery of
> non-responsiveness, although it would possibly be a performance
improvement if
> the benefits of initiating replacement operations sooner were greater than
the
> costs of doing that discovery.  We do not do fast discovery of non-
> responsiveness, and anyway the speed-up, if any, would be dwarfed by our
> biggest performance problem, which is caused by some code that we wrote
> thinking that we could discover that the underlying "connection" was
"broken"
> and then "close" that "connection", thus saving some bandwidth.  In fact
we
> can't do that because our underlying comms, in order to be able to hop
over
> firewalls, does not rely on TCP's "connection status" features.  As an
aside,
> this also makes our protocol safer against active attacks, but the
motivation
> was firewall-hopping, not paranoid security considerations.)

This makes sufficiently little sense to me, I now have to wonder if you and
I mean the same thing when we say, "connection". I'm afraid my definition of
connection is a little too concrete to be helpful, however--for me a
connection is "lost" when E catches a problem in the when-catch block :-)
And a  "connection" is the infrastructure that I never really see directly,
which allows me to send an eventual message: I do indeed have a "connection"
to anything I can send to and get a non-problem resolution of the promise
(i.e., a fulfillment :-) Perhaps this definition of connection, while too
concrete in one way, is actually a higher-level concept than the one you are
using?

>
> (In fact, now that I think about it, the next performance improvement in
Mojo
> Nation, which will effectively fix the performance losses due to
accidentally
> relying on "connections", amounts to doing faster discovery of
> non-responsiveness from the top-down requirements and knowledge rather
than
> from the bottom-up knowledge of the comms substrate.  In this planned
> improvement, you remember how long it took this particular counterparty to
> respond to this particular kind of transaction in the past, and if the
current
> transaction has already taken longer than 97% of previous such
transactions,
> then launch a replacement operation.  The bottom-up solution would not be
able
> to do that kind of nuanced solution, and it would also suck up network
> resources and computational resources with its unforgeable keep-alives.)

This sounds like it is very easy to do in E. The keep-alives are, as Bill
highlighted recently, so infrequent they hardly constitute a drain on
resources. You just write a smart timebomb for use with the code I posted
earlier, put it in the appropriate when-catches, and poof, it is done. It
would be easier with replaceable-default impatience policies if you wanted
to use the same one everywhere...but it doesn't compel me to feel a need to
eliminate liverefs from E.

> > The illusion of connection is so valuable, and the failure of that
illusion
> > so significant, that any system that does not let the programmer
> > distinguish between the two is fundamentally impairing their ability to
> > program in a distributed system.  All that disbelieving the illusion of
> > connection does is impose all the uncertainty of communication on all
> > requests, to no advantage.  It is much like requiring that all circuits
be
> > designed as analog, and discarding then digital logic toolkit.
>
> My experience with Mojo Nation suggests the opposite to me, but I am
willing to
> reconsider if you have any arguments more persuasive than analogy to the
> Industry Standard Successful Abstraction.  :-)

There is something very different about your Mojo Nation experience and my E
experiences. I don't even know what questions to ask to figure out how these
experiences could be so different. Certainly, Mojo Nation is a much bigger
undertaking than any of my little E programs, so it is a very important
example. But my limited knowledge of Mojo Nation suggests that the E
programmer's conceptualization of a "connection" would not particularly get
in the way of solving the problem.  Even without replaceable-default
impatience policies.  So either I am not understanding something about Mojo
Nation "connections", or you are not understanding something about E
"connections". In fact, I suspect that we are both not understanding
something (though it wouldn't be the first time I was the only one confused,
and confused about other people's confusion :-)

A place where the rubber meets the road with sturdy refs that no one has
mentioned yet is in persistence. A sturdyref is meaningless unless you are
persisting the ref. So calling for all connections to be sturdy seems to me
to be a call to assume the policy that every object ever spoken to across a
system must be persisted sturdily. I have not had enough experience with
persisting distributed systems to know whether this is a good policy (in
fact, no experience at all, except indirectly as boss of engineering at EC),
but it seems extreme to assume that this is the only policy that makes sense
(indeed, the EC experience argued strongly that this is NOT the only policy
that makes good sense). A sturdy ref that disappears forever when a computer
reboots is a liveref in sturdyref clothing. And if the computers never
reboot and the apps run forever, the liverefs are sturdyrefs in liveref
clothing :-)

--marcs