EROS checkpointing and memory management
Tue, 22 Jun 1999 13:46:54 -0400

Jim Lewis writes:

>I've been thinking about this since your last post. Seems to me the
>primitive one needs is the ability to say "This object and all its
>dependents need to be written atomically".

Alan Cox has suggested that this is very hard.  It definitely *is* hard.  It all
comes down to a couple of issues.  If you have done the transaction support
necessary to track the causality you have incurred a great deal of expense to
handle a low likelihood case.  If you have to trace out the dependencies after
the fact (in the style of a garbage collection mark pass) you incur a lot of
cache damage.  Worst of all, sometimes the consistency rules force you to cross
subsystem boundaries, and then things get really painful.

It proves, however, that simply snapshotting the ENTIRE MACHINE is quite easy
and quite efficient.  Contrary to what you might expect, it does not induce
excess I/O -- there is some evidence that doing so may in practice *reduce*
total I/O in the common case.  There are definitely boundary cases, and they
definitely need to be addressed.

For an example of an implementation that does this in practice and works very
nicely, see "The Checkpoint Mechanism in KeyKOS" at:

[I lost track of who is who here:]

>> Why is having persistence managed by a library that is playing guessing games
>> of your intent a good idea ? It has to know about object relationships,
> potentially it has to blindly snapshot the entire system.

Hopefully, it's now clear that EROS doesn't try to discover object
relationships.  It does indeed snapshot the entire system.  Several people have
put forward analogy arguments based on garbage collection to suggest that the
performance of this will be inadequate.  Some of these arguments are clever, and
worth examining.  The bottom line, however, is that the performance numbers are
what they are, and that global checkpointing works just fine.

In brief, here is why:

There are really three separate issues:

1. How fast can you build the consistent snapshot?
2. How fast can you write the resulting snapshot to the disk?
3. How much does this write traffic (which will be done again after the snapshot
is stabilize) impede other things that you would rather be doing?

The anser to (1) is that you can do the snapshot by simply marking everything
copy on write.  In our current implementation this takes something on the order
of 50ms every 5 minutes.  We have a design that I think will get us into the low
hundreds of *microseconds*.

Problem (2) isn't hard once the snapshot is built. The snapshot itself is
written sequentially to preallocated space.  This can be done quite fast.  In
actual practice the main problem is making sure that you do not do it so
aggressively that you impede other I/O necessary for the system to make
progress.  In short: the problem is not getting it out fast enough, it's getting
it out *slow* enough.

As to (3), we have previously discussed in another part of this thread the fact
that EROS sustains 2.5x to 3x the write bandwidth of the best UNIX systems.
Given this, I think a strong case can be made that the overhead (and there *is*
a little bit) is coming out of headroom that simply doesn't exist in competing

>For the *exact same* reasons that automatic memory management with
>garbage collection is preferable to slinging your own buffers.

The reason that garbage collection isn't a universal panacea has to do with
performance.  For some applications garbage collection is slower than malloc,
for others its faster, and for the great majority it doesn't matter.  There are
challenging real-time issues in garbage collection.  While that's important, we
shouldn't get distracted over it here.

These issues do not appear to apply to a well-implemented checkpoint system.  To
get to a place where checkpointing is a bad design, your system must stabilize
essentially no data.  If you stabilize even a fairly small amount of data over a
five minute period (low megabytes), you are better off checkpointing.  If you
stabilize a very large amount of data, the design has so much sustainable I/O
headroom relative to conventional systems that you are better off checkpointing.

So the only case where you *don't* win is the case where you have only a *tiny*
amoutn of data.  In this case, checkpointing is a lose, but it's so cheap that
you just shouldn't worry about it.  Also, it gives you the ability to kick the
plug out of the wall (literally) without losing everything, and this matters.

Jonathan S. Shapiro, Ph. D.
IBM T.J. Watson Research Center
Phone: +1 914 784 7085  (Tieline: 863)
Fax: +1 914 784 7595

"James B. Lewis" <> (by way of "Mark S. Miller" <markm on
06/21/99 06:36:59 PM

To:   Jonathan S Shapiro/Watson/IBM@IBMUS
Subject:  Re: Some very thought-provoking ideas about OS architecture.

Email for Mark Miller that defaulted to me because there is no markm alias
at Foresight.
Jim Lewis, Foresight webmaster

Date: Sun, 20 Jun 1999 10:04:54 -0400
From: "Eric S. Raymond" <>
Subject: Re: Some very thought-provoking ideas about OS architecture.
Mime-Version: 1.0
Organization: Eric Conspiracy Secret Labs
X-Eric-Conspiracy: There is no conspiracy

> That depends on if you want to consistency check your object store after a
> crash. Unless you journal the object store - which btw is hard. If you have
> two thousand inter-related objects you need to dump the set of them
> consistently and in a snapshotted state.

I've been thinking about this since your last post. Seems to me the
primitive one needs is the ability to say "This object and all its
dependents need to be written atomically".  Not too hard to imagine
how to do that given you already have enough of a VM system to do
copy-on-write.  OK, you end up having to allocate in two different
spaces, one with atomicity constraints and one without. But it's
solvable.  (See below on why this doesn't mean you end up journaling

> Why is having persistence managed by a library that is playing guessing games
> of your intent a good idea ? It has to know about object relationships,
> potentially it has to blindly snapshot the entire system. It has to do a lot
> of work to know in detail what has changed.

For the *exact same* reasons that automatic memory management with
garbage collection is preferable to slinging your own buffers.  Perl
and Python and Tcl are on the rise because, outside the kernel, accepting
all that complexity and the potential for buffer overruns just doesn't
make any damn sense with clocks and memory as cheap as they are now.

Remember, the name of the game in OS design is really to optimize for
least complexity overhead for the *application programmer* and *user*.
If this means accepting a marginally more complex and less efficient
OS substructure (like the difference between a journaled object store
and a file system with explicit I/O) then that's fine.  But in fact I
think Shapiro makes strong arguments that an object store, done
properly, is *more* efficient.

> So all you have to do is export every object that this object refers to. Like
> the windowing environment, whoops oh dear.

Now you know it's not that bad in practice.  Not all object references are
pointers.  Some are capabilities and cookies that are persistent without
prearrangement.  That's especially likely to be true of OS services, and
especially if you design your API with that in mind.

> Suppose Eros was just a set of persistent object libraries that ran on
> top of numerous other platforms too, could be downloaded off the net and
> pretty well within the limits of the "programmer lazy, do more work than
> worked needed" paradigm.
> And that is demonstrably the right way up. If you put a "lazy programmer"
> system at the bottom of an environment you prevent the smart programmer doing
> smart things. If your bottom layer is fundamentally ignorant of programmer
> provided clues you cripple the smart.

If that's true, why is Perl a success?

That's not intended to be a snarky question.  Your argument here is
essentially the argument for malloc(3) as opposed to unlimited-extent
types and garbage collection.  And the answer is the same: there comes
a point where the value of the optimization you can do with hints no
longer pays for the complexity overhead of having to do the storage
management yourself.

The EROS papers implicitly argue that we've reached that point not
just in memory management but with respect to the entire persistence
problem.  I'm inclined to agree with them.

At the very least, it's something that I think we'd all be better off
doing a little forward thinking about.  As I said at the beginning of
the thread, I'm not after changing the whole architecture of Linux
right away; that would be silly and futile.  But this exchange will
have achieved my purposes if it only plants a few conceptual seeds.
          <a href="">Eric S. Raymond</a>

The right of self-defense is the first law of nature: in most
governments it has been the study of rulers to confine this right
within the narrowest limits possible.  Wherever standing armies
are kept up, and when the right of the people to keep and bear
arms is, under any color or pretext whatsoever, prohibited,
liberty, if not already annihilated, is on the brink of
     -- Henry St. George Tucker (in Blackstone's Commentaries)

James B. Lewis, Ph.D.          James B. Lewis Enterprises
7527 40th Avenue NE, Seattle, WA 98115-4925
E-mail:    alternate e-mail:
World Wide Web: