Re: "Harprit Kaur Barha": Info on Eros
shapj@us.ibm.com
Mon, 26 Apr 1999 09:52:27 -0400
> I'm interested in knowing more about Eros, but when visiting
> the site i noticed that some of the vital information is unavailable
> since the pages are still under progress. In particular, i am
> interested in learning more about the features rearding the handling
> of deadlocks and distribution, which appear on the page
Harprit:
The page you chanced to look at is one of the very earliest pages on the EROS
web. After a lot of work, I now know how to make EROS distribute wihtin a
cluster, but the current system does not do so. It is not clear that
distributing a single level store beyond a cluster makes sense -- there are too
many places where the resulting rollbacks would present practical difficulties.
The rule of thumb seems to be that if you can afford to treat the collection of
machines in the distributed system as a single point of failure then
distribution will work fine, otherwise you shouldn't distribute.
As to deadlocks, there is nothing to handle because EROS is deadlock-free. This
is accomplished very simply:
- All operations performed by the kernel are atomic. They either execute
completely or not at all. If, for example, a process "blocks" for a page fault
while performing a kernel operation, it gets bounced all the way out of the
kernel. It sleeps while the page is coming in, and then retries the operation
from the beginning. Because of this design, no resources are held in the kernel
by sleeping processes, and there are therefore no cross-process locking
dependencies. This removes one common source of deadlock.
- EROS does not allocate any memory after initialization. It is therefore not
possible for an action in the kernel to become blocked for lack of memory, and
memory starvation is therefore not a possible source of deadlock.
- The EROS kernel contains no unbounded loops or recursions. While there are
algorithms in the kernel that are expressed recursively, *all* of these check
the recursion depth and quit at some known depth. All of these algorithms could,
with sufficient inlining, be expressed as for(;;) loops with some upper bound on
the number of iterations. Because of this all kernel operations have bounded
duration and bounded stack depth. Bounded duration is important, because it
allows us to use a design in which only one process is conceptually in the
kernel at a time. If only one process can be in the kernel, many issues of
deadlock are avoided. Bounded stack is important to make sure that the kernel
stack does not overflow.
All that having been said, much of this would break down in a multiprocessor
implementation. The kernel includes calls in all of the necessary places to pin
objects in memory, but the pinning architecture would need to be modified for
multiprocessors, and it is possible with low likelihood that two processors
executing in the kernel simultaneously on behalf of two processes would arrive
at deadlock situations. A couple of observations about this and how to solve it
in practice:
- First, the fact that the pins() exist solves the main problem -- which is
getting all the right lock() operations to happen in all the right places. The
current calls to pin() would need to be replaced by locking calls, but this is
not hard.
- There exists a design for a locking discipline that does not require unlocks.
It is sketched in my dissertation. This has *dramatic* performance consequences.
- For systems with only two to four processes, a single grand universal kernel
lock would probably suffice, and would certainly be more efficient than
introducing locking overhead pervasively. I'm actually dubious about this, but
it's the obvious first thing to try and would "solve" the problem for commodity
multiprocessors. Keep in mind that the operations performed by the EROS kernel
are very fine grain relative to POSIX system calls. This solution therefore
*should* work rather better than it did under, say, Linux.
- Once you have to do the full multiprocessor design I think that the right
solution to deadlock will prove to be optimistic locking with scheduled
fallback. Basically, you do shared read locks and exclusive write locks. If a
process tries to lock something for writing that was previously locked for
writing (or reading) by someone else, it backs all the way out of the kernel and
blocks on the party who locked that resource. To prevent starvation, the
"retry" in this case grabs the grand unified kernel lock, thus assuring
uncontended access to resources and preventing starvation. Here again, it's a
painfully simple design that ought to go a long way toward solving the problem.
Anyway, I hope that some of these notes are useful to you. Please feel free to
follow up with me.
Jonathan S. Shapiro
IBM T.J. Watson Research Center
Email: shapj@us.ibm.com
Phone: +1 914 784 7085 (Tieline: 863)
Fax: +1 914 784 7595