Fw: Some very thought-provoking ideas about OS architecture.

shapj@us.ibm.com shapj@us.ibm.com
Sun, 20 Jun 1999 16:54:29 -0400

Okay, I promised two notes, but it's going to be three.  These are responses to
specific things that Alan Cox has raised in his various messages.  Please note
that part of the disconnect concerns a misconception on Alan's part about what
an "object" is in the EROS context.

Before we proceed too far down a confusing discussion, might I trouble you to
read the brief Programmer's Introduction to EROS, which you can find at


A bunch of Alan's concerns and objections are valid, but until we agree on what
we mean by object there are too many kinds of misconceptions possible in the
exchange, and too many ways in which the outcomes can be misunderstood.

The bottom line, I suppose, is that the benchmark results contradict Alan's
contention that the EROS object model is too expensive.

>[Alan Cox writes:]
> This [checkpointing] is actually an old idea. The problem that has never
> been solved well is recovery from errors. You lose 1% of your object
> store. How do you tidy up. 20% of your object store dies in a disk crash,
> how do you run an fscobject tool. You can do it, but you end up back
> with file system complexity and all the other fs stuff.

These claims are accurate, but they cannot be addressed without taking specific
examples.  Eric has gone a little far in saying that there is no file system in
EROS.  There is a global object store, and it does indeed have many of the
properties of a file system.  It's structure is greatly simplified by three

     1. All objects are one of two types (pages or nodes).
     2. All nodes are the same size; all pages are the same size
     3. The metadata update problem is eliminated by the checkpoint logic.

The last is quite significant, as it largely eliminates write after write
hazards (write ordering for FS metadata), which is a major source of performance
issues in conventional file system designs.  There is no magic here; the
leverage is similar to that of log-structured file systems, but the design has
better behavior than either the Ousterhout or the Seltzer designs.

With that as preamble, let me take your points in turn.

First, there are the same issues with metadata preservation that arise in
conventional file systems.  In EROS, the resolution is to duplex the node
structures (the ones that hold capabilities), which is about 13% of the store.
There is a small amount of critical data that also needs to be duplexed (e.g.
the system storage allocator).  Beyond that it's up to you: how critical is your
data and how expensive is the disk drive?

Duplexing is built in to EROS.  RAID would be perfectly acceptable as an

As far as I know, there are only three ways to lose part of an object store:

1. Sector-level failure.  No file system will address this problem in the
absence of duplexing, and in the presence of duplexing it's not an issue.  Note
that EROS actually checks the write result, which is more than most systems do.
2. Inconsistent writes.  This arises from write after write hazards, as
mentioned above, which don't occur in a log-structured design..
3. Physical loss of media (e.g. you lose a drive).  If you're not properly
duplexed, you're hosed.

Note, however, that *none* of these are comparable in complexity to the usual
file system cases.  Having eliminated (2), and resolved (or not) (1) by
duplexing, all of the remaining cases correspond to physical loss of data.  No
"recovery" is possible in these cases, and so the usual complexities of fsck do
not apply.

EROS does need to take some care when it comes up to ensure that all duplexes
are consistent with each other.  This is simple, and transparently done with no
administrative intervention.

>> [Alan Cox:] Another peril is that external interfaces don't always like
replay of events.
>[Eric Raymond:] A much more serious objection, I agree.

There is a more fundamental issue: replaying events to external interfaces
violates security, because the connections need to be reauthenticated.  EROS
drivers are part of the kernel, which is outside of the checkpoint contract, so
the issue you are getting at doesn't really apply where devices are concerned.
Low-level code (e.g. the net stack) must be written with awareness of the
checkpoint mechanism.  The operating system explicitly rescinds all external
connections on restart in order to ensure that reauthentication occurs

>>[Alan:]  Moving just some objects between systems is fun too. You then get
>> into cluster checkpointing, which is a field that requires you wear a pointy
>> hat, have a beard and work for SGI or Digital.

Or read one of my early Penn TR's, which describes how to build efficient,
fault-tolerant clusters.  Actually, you've conflated two cases.  One is running
a single system image across a cluster.  That's the SGI/Digital/Snocrash (last
is my TR) problem.  The other is how to move a document from one place to

Transparent checkpointing does not eliminate the need for serialization. There
remains a need for this for document interchange.  Once you have a serialized
capture of the state you want to move, a pointy hat becomes optional. :-)

> [Alan] Their numbers are for a microkernelish core. They are still very
> good, but that includes basically no drivers, no network stack, no
> graphics and apparently no real checkpoint/restart in the face of
> corruption. I may be wrong on the last item.

I don't know how microkernelish the EROS core is -- I don't claim that it's a
microkernel (at least, not any more).  Also, size comparisons are difficult
because even the best C++ compilers introduce a lot of overhead.  In retrospect,
writing the kernel in C++ was a mistake.

The absence of drivers and network stack has no bearing on the benchmarks
published -- such code would not be in the benchmark path in any case.   We
certainly need to construct and benchmark the net stack, but there is no reason
to believe that it will perform any worse than any other network stack.  Ditto
graphics.  Actually, most of the other drivers you might expect are present.

The checkpoint situation is rather embarassing.  It was working, and I tweaked
something. The current system writes the checkpoint correctly, but the logic
that checks the consistency of the checkpoint image on restart has an arithmetic
error that precludes successful restart in certain cases.  Unfortunately, my
dispensation from IBM to work on this code timed out before I could find and fix
the problem.

>> [Alan] That nature of I/O is no different. If you always do large
>> sequential block writes tell me how it will outperform a
>> conventional OS if only a small number of changes in a
>> small number of objects occur.
>[Eric] No seeks to read inodes, because the map from EROS's
> virtual-space blocks to disk blocks is almost trivial (essentially
> the disks get treated like a honkin' big series of swap volumes).
> So the disk access pattern would be quite different, I think.

Eric hasn't read the code :-)

For large files, the inode read (actually indirection block read) overhead isn't
that big, because the ratio of indirection blocks to content blocks is low.  For
*growing* files it's another story, but to some extent you can make that into a
single long transaction if you are a little careful about write ordering.  Note
that this is an area where Linux could be improved.

The real win, and the answer to Alan's question, is that it's not about large
sequential writes.  It's about bulk block transfers. As you modify small
objects, changes are written to a write-ahead log, which is where the disk head
spends most of it's time.  Empirically, you are rarely more than three tracks
from where you want to be.  This significantly reduces seek time.  Also, the log
is logically append-only, which means that few writes (even if they are to
different objects) require a seek at all.  In this sense, there is indeed a
large sequential block write occuring.

Eventually, the log fills up, a checkpoint transaction is declared, and a bulk
transfer of the data to it's home locations occurs.  What Eric describes as a
sequential block write isn't really what happens here.  What really happens is
that you reach into the log, pull out an overflowing handful of blocks to be
moved to their home location, and then transfer them in a single arm pass over
the disk.  In practice, many of the mutates are clustered, so you end up doing
seeks only between clusters.  Essentially it's a sorted bulk transfer.  If the
machine fails during transfer you are okay because a complete snapshot exists in
the checkpoint log.

Note that this sort of bulk sorting is not possible if you have to be careful
about metadata update ordering while writing to files.  The absence of such
correct ordering is why fsck is so complex in conventional file systems.

This idea could be adapted to Linux if we went to a virtual disk model, wherein
disks hold logical block ranges and all file system operations are done on
logical blocks.  The same bulk write logic could then be applied between the
buffer cache and the backing store.

Part of the disconnect here, I think, is that Alan seems to be using "object" in
the programming language sense rather than the operating system sense.  From the
EROS perspective, the "objects" are pages and nodes.  The addition of
checkpointing to the persistence model has consequences that Alan perhaps hasn't
fully considered, as it causes most object updates to be purely in-memory with
no directly associated writes.

> [Alan] The mapping of objects to disks is complex the
> moment you allow an object to expand or want to
> reclaim space.

Agreed, but pages and nodes don't change size.  There is a disconnect between
what you are calling an object and what the EROS persistence layer calls an
object.  At the disk level, all objects are blocks, period.  The issues you
raise cease to be an object store problem if system-wide checkpointing of blocks
is used.

>1.  You need an indexing object - in a fileystem it is a directory or
>    directory tree
>2.  You need to position objects/files for locality
>3.  You need fragmentation resistant algorithms

In EROS, (2) and (3) are the responsibility of the space bank, which allocates
objects in localized extents.  By making suitable use of these banks,
applications get extent-based allocation.

By "indexing object", what you really mean is that you need metadata --
indirection blocks, if you prefer.  EROS simply makes the file metadata and the
memory address space metadata into the same object, and let's the user build it
while building the file.  Since the checkpoint mechanism guarantees a consistent
snapshot, there is no write ordering issue.

>Why is having persistence managed by a library that is playing guessing games
>of your intent a good idea ? It has to know about object relationships,
>potentially it has to blindly snapshot the entire system. It has to do a lot
>of work to know in detail what has changed.

The short answer is that snapshotting the entire system is significantly more
efficient than doing it by hand. You are failing to consider interprocess
consistency. The long answer is the subject of my next email.

> Suppose Eros was just a set of persistent object libraries that ran on
> top of numerous other platforms too

Ditto with regard to reading my next note, but how to you maintain any kind of
security under your proposal?.

> I don't think so. Nobody has ever demonstrated a working object store based
> OS handling a real evil job load. When Shapiro can run innd under a full load
> faster than a conventional OS can, and can network share the article objects
> to 16 front end servers, then I'll be more convinced.

Actually, Alan just made my case.  The only problem is that he proposes an
uninteresting metric.

Let's take a fundamentally more interesting metric: online VISA transaction
processing, where the transactions rates are an order of magnitude higher and
involve data mutation of small records rather than bulk transfer of news
articles, which are write-once and update-never, and therefore pretty easy to do
well on.  Take a look at "Object Oriented Transaction Processing in the KeyKOS
Microkernel:", in particular section 8.1


Now in fairness I have to point out that this was a KeyKOS benchmark.  While we
haven't got that code running on EROS yet, EROS is significantly faster than
KeyKOS was.

>The "conventional" model of memory allocation requires address space and page
>allocation services. page protection services help. The object based model
>needs address and page allocation services and really wants page protection
>So they share a common underlying set of needs. Should the OS provide
>conventional model, object model, or the underlying service both needs. The
>answer is obvious and its precisely want unix does supply.

This is the "object" ambiguity again. Since EROS objects are pages, EROS
basically unifies all of what Alan is after into a single layer, thus delivering
better performance.

Also, this unification largerly eliminates his objections on ground of smart vs.
dumb programmers.  To put it another way: a whole hoard of smart programmers are
getting 30% to 40% disk write bandwidth utilization.  EROS, taking the dumb
approach, does much better.  The question isn't whether smarter is better. The
question is whether better is better.

Jonathan S. Shapiro, Ph. D.
IBM T.J. Watson Research Center
Email: shapj@us.ibm.com
Phone: +1 914 784 7085  (Tieline: 863)
Fax: +1 914 784 7595