Anonymous Unspoofable Cyclic Distributed Static Linking: Part 1
Mark S. Miller
markm@caplet.com
Thu, 13 Jan 2000 11:06:12 -0800
Secure mobile code presents many interesting problems. Among them is the
issue of "distributed linking". How does module "a" import module "b"? How
is this import resolved? In what namespace? Can an attacker preload his
own "b" into some target vat's namespace, such that "a" imports that version
rather than the version he desired? Can we protect "a" against such
attacks, but still benefit from distributed replicated caching of code that
someone might import?
It turns out that the design of the old Electric Communities (aka
Communities.com) "Repository System", created mostly by Monica Anderson
(cc'ed on this message) can solve most of these problems, even though it was
built to solve a "different" problem. Actually, I just never noticed before
that the problem the Repository set out to solve is the general case of the
mobile code distributed-linking problem.
Repository Recap
The Repository was built to solve (and, in the beta of EC Habitats, solved)
the problem of distributed unspoofable caching of acyclicly-linked large
data blobs. Engineering observation: cryptographic hashes don't collide.
Therefore, if you store and retrieve data by its hash, you need never worry
about collision management. In other words, data blobs and hashes of those
blobs are in one-to-one correspondence. (Of course, Shannon would have a
fit, but, as Sussman and Abelson might say, this is the difference between
engineering and science.) Therefore, a hash can be used as a compact
designator of any data blob. The data can be looked up from any source,
whether trusted or not, as long as the recipient of the data re-hashes it to
ensure that it is indeed the data designated by the hash.
In Original-E, on which EC Habitats was built, capabilities & symbolic
traffic were transmitted over the equivalent of today's Pluribus protocol.
However, when Alice wanted to send a large bitmap to Bob, she did not
transmit the bitmap over Pluribus. Rather, she "published" the bitmap to
her Repository, and sent to Bob over Pluribus a designator of the bitmap:
the hash of the bitmap + a "location hint" pointing at her Repository as a
plausible place to obtain the bitmap.
Bob, on receiving this bitmap designator, asked his own Repository for the
corresponding data blob, and got back a promise for the answer. If his
Repository already had an entry for this hash, this promise would be
fulfilled promptly. Otherwise, his Repository would contact at least
Alice's Repository to obtain an alleged answer. On receiving data from
Alice's Repository, Bob's Repository would re-hash to ensure it was correct,
and if so, both cache it locally and fulfill the promise handed to Bob.
Bob's Repository is part of Bob's TCB, but Alice's Repository is not.
Therefore Bob's Repository had to rehash the data obtained from Alice's
Repository, but Bob doesn't need to re-hash data obtained from his own
Repository.
Advantages:
* Large data fetches are kept out-of-line, and so can proceed in parallel
with further symbolic traffic. In practice, this is a big deal. One might
eventually want to prioritize the use of bandwidth, so that the Pluribus
traffic can be higher priority than the data traffic. (MarcS: Notice
possible applicability to Edesk.)
* Data can be cached in a replicated manner by untrusted caching services,
since integrity doesn't depend on the source.
* Unlike DNS or a zillion other schemes, there is no namespace or assigned
number space that requires global coordination. Hashes have the same
anonymous first-class designation-with-integrity nature for data that
capabilities/object-references do for objects.
Security Issues:
* Since Bob's cache will re-fetch from Alice's cache on a cache miss, this
gives more opportunity for computation in Bob's vat to wall-bang information
out of Bob's vat. Since E's security premises include giving up on
preventing wall-banging anyway, this isn't a big deal. However, we must be
sure not to lose our ability to do deterministic replay (ie, E must continue
to have loggable non-determinism).
* What crypto should be used to transfer the data between Repositories?
Curiously, we need encryption for privacy, but the protocol doesn't need to
protect our integrity, since the re-hashing on reception is all the
integrity we might need. Why do we need privacy? With privacy, and given
that crypto-hashes are also adequately unguessable (is this generally
true?), we can take the stance than knowledge of the hash of a published
blob, plus access to any Repository, is authorization to obtain the blob.
Why "plus access to any Repository"? Confined computation will normally not
have access to any Repository, just as it normally won't have access to any
Introducer http://www.erights.org/elib/capability/dist-confine.html
http://www.erights.org/elang/concurrency/introducer.html . This potentially
alleviates the above wall-banging issue as well.
Acyclic Linking
Of course, besides pure data per se, the data of a blob can include such
hash-based designators of other blobs. The Repository required no mechanism
to support this -- just as it was up to the client to interpret data as a
bitmap, it was up to the client to interpret nested data designators and
perform further fetches from the Repository.
Of course, during development the source form of each such data blob would
usually need to specify not a particular edition of some other data blob,
but (implicitly) the "current version". In building blobs from
blob-sources, such development-time inter-blob names (in some
development-time namespace) gets turned into namespace-free hash-based
inter-blob designators. (Lightbulb) This can be seen as a build-time static
linking step. Unfortunately, this solution works only for acyclic blob
structures, which was fine for all the data we dealt with then, but doesn't
work at all for code. (The EC Habitats beta didn't do mobile code.)
On to Cycles in Part 2.
Cheers,
--MarkM