Okay, I promised two notes, but it's going to be three. These are responses to specific things that Alan Cox has raised in his various messages. Please note that part of the disconnect concerns a misconception on Alan's part about what an "object" is in the EROS context.
Before we proceed too far down a confusing discussion, might I trouble you to read the brief Programmer's Introduction to EROS, which you can find at
http://www.eros-os.org/devel/intro/ProgrmrIntro.html
A bunch of Alan's concerns and objections are valid, but until we agree on what we mean by object there are too many kinds of misconceptions possible in the exchange, and too many ways in which the outcomes can be misunderstood.
The bottom line, I suppose, is that the benchmark results contradict Alan's contention that the EROS object model is too expensive.
>[Alan Cox writes:]
>
> This [checkpointing] is actually an old idea. The problem that has never
> been solved well is recovery from errors. You lose 1% of your object
> store. How do you tidy up. 20% of your object store dies in a disk crash,
> how do you run an fscobject tool. You can do it, but you end up back
> with file system complexity and all the other fs stuff.
These claims are accurate, but they cannot be addressed without taking specific examples. Eric has gone a little far in saying that there is no file system in EROS. There is a global object store, and it does indeed have many of the properties of a file system. It's structure is greatly simplified by three things:
The last is quite significant, as it largely eliminates write after write hazards (write ordering for FS metadata), which is a major source of performance issues in conventional file system designs. There is no magic here; the leverage is similar to that of log-structured file systems, but the design has better behavior than either the Ousterhout or the Seltzer designs.
With that as preamble, let me take your points in turn.
First, there are the same issues with metadata preservation that arise in conventional file systems. In EROS, the resolution is to duplex the node structures (the ones that hold capabilities), which is about 13% of the store. There is a small amount of critical data that also needs to be duplexed (e.g. the system storage allocator). Beyond that it's up to you: how critical is your data and how expensive is the disk drive?
Duplexing is built in to EROS. RAID would be perfectly acceptable as an alternative.
As far as I know, there are only three ways to lose part of an object store:
Note, however, that *none* of these are comparable in complexity to the usual file system cases. Having eliminated (2), and resolved (or not) (1) by duplexing, all of the remaining cases correspond to physical loss of data. No "recovery" is possible in these cases, and so the usual complexities of fsck do not apply.
>> [Alan Cox:] Another peril is that external interfaces don't always like
replay of events.
>
>[Eric Raymond:] A much more serious objection, I agree.
There is a more fundamental issue: replaying events to external interfaces violates security, because the connections need to be reauthenticated. EROS drivers are part of the kernel, which is outside of the checkpoint contract, so the issue you are getting at doesn't really apply where devices are concerned. Low-level code (e.g. the net stack) must be written with awareness of the checkpoint mechanism. The operating system explicitly rescinds all external connections on restart in order to ensure that reauthentication occurs correctly.
>>[Alan:] Moving just some objects between systems is fun too. You then get
>> into cluster checkpointing, which is a field that requires you wear a pointy
>> hat, have a beard and work for SGI or Digital.
Or read one of my early Penn TR's, which describes how to build efficient, fault-tolerant clusters. Actually, you've conflated two cases. One is running a single system image across a cluster. That's the SGI/Digital/Snocrash (last is my TR) problem. The other is how to move a document from one place to another.
Transparent checkpointing does not eliminate the need for serialization. There remains a need for this for document interchange. Once you have a serialized capture of the state you want to move, a pointy hat becomes optional. :-)
> [Alan] Their numbers are for a microkernelish core. They are still very
> good, but that includes basically no drivers, no network stack, no
> graphics and apparently no real checkpoint/restart in the face of
> corruption. I may be wrong on the last item.
The absence of drivers and network stack has no bearing on the benchmarks published -- such code would not be in the benchmark path in any case. We certainly need to construct and benchmark the net stack, but there is no reason to believe that it will perform any worse than any other network stack. Ditto graphics. Actually, most of the other drivers you might expect are present.
The checkpoint situation is rather embarassing. It was working, and I tweaked something. The current system writes the checkpoint correctly, but the logic that checks the consistency of the checkpoint image on restart has an arithmetic error that precludes successful restart in certain cases. Unfortunately, my dispensation from IBM to work on this code timed out before I could find and fix the problem.
>> [Alan] That nature of I/O is no different. If you always do large
>> sequential block writes tell me how it will outperform a
>> conventional OS if only a small number of changes in a
>> small number of objects occur.
>
>[Eric] No seeks to read inodes, because the map from EROS's
> virtual-space blocks to disk blocks is almost trivial (essentially
> the disks get treated like a honkin' big series of swap volumes).
> So the disk access pattern would be quite different, I think.
Eric hasn't read the code :-)
The real win, and the answer to Alan's question, is that it's not about large sequential writes. It's about bulk block transfers. As you modify small objects, changes are written to a write-ahead log, which is where the disk head spends most of it's time. Empirically, you are rarely more than three tracks from where you want to be. This significantly reduces seek time. Also, the log is logically append-only, which means that few writes (even if they are to different objects) require a seek at all. In this sense, there is indeed a large sequential block write occuring.
Eventually, the log fills up, a checkpoint transaction is declared, and a bulk transfer of the data to it's home locations occurs. What Eric describes as a sequential block write isn't really what happens here. What really happens is that you reach into the log, pull out an overflowing handful of blocks to be moved to their home location, and then transfer them in a single arm pass over the disk. In practice, many of the mutates are clustered, so you end up doing seeks only between clusters. Essentially it's a sorted bulk transfer. If the machine fails during transfer you are okay because a complete snapshot exists in the checkpoint log.
Note that this sort of bulk sorting is not possible if you have to be careful about metadata update ordering while writing to files. The absence of such correct ordering is why fsck is so complex in conventional file systems.
Part of the disconnect here, I think, is that Alan seems to be using "object" in the programming language sense rather than the operating system sense. From the EROS perspective, the "objects" are pages and nodes. The addition of checkpointing to the persistence model has consequences that Alan perhaps hasn't fully considered, as it causes most object updates to be purely in-memory with no directly associated writes.
> [Alan] The mapping of objects to disks is complex the
> moment you allow an object to expand or want to
> reclaim space.
>1. You need an indexing object - in a fileystem it is a directory or
In EROS, (2) and (3) are the responsibility of the space bank, which allocates
objects in localized extents. By making suitable use of these banks,
applications get extent-based allocation.
> directory tree
>2. You need to position objects/files for locality
>3. You need fragmentation resistant algorithms
By "indexing object", what you really mean is that you need metadata -- indirection blocks, if you prefer. EROS simply makes the file metadata and the memory address space metadata into the same object, and let's the user build it while building the file. Since the checkpoint mechanism guarantees a consistent snapshot, there is no write ordering issue.
.[Alan]
>Why is having persistence managed by a library that is playing guessing games
>of your intent a good idea ? It has to know about object relationships,
>potentially it has to blindly snapshot the entire system. It has to do a lot
>of work to know in detail what has changed.
The short answer is that snapshotting the entire system is significantly more efficient than doing it by hand. You are failing to consider interprocess consistency. The long answer is the subject of my next email.
> Suppose Eros was just a set of persistent object libraries that ran on
> top of numerous other platforms too
> I don't think so. Nobody has ever demonstrated a working object store based
> OS handling a real evil job load. When Shapiro can run innd under a full load
> faster than a conventional OS can, and can network share the article objects
> to 16 front end servers, then I'll be more convinced.
Actually, Alan just made my case. The only problem is that he proposes an uninteresting metric.
Let's take a fundamentally more interesting metric: online VISA transaction processing, where the transactions rates are an order of magnitude higher and involve data mutation of small records rather than bulk transfer of news articles, which are write-once and update-never, and therefore pretty easy to do well on. Take a look at "Object Oriented Transaction Processing in the KeyKOS Microkernel:", in particular section 8.1
http://www.cis.upenn.edu/~KeyKOS/KeyTXF/KeyTXF.html
Now in fairness I have to point out that this was a KeyKOS benchmark. While we haven't got that code running on EROS yet, EROS is significantly faster than KeyKOS was.
>The "conventional" model of memory allocation requires address space and page
>allocation services. page protection services help. The object based model
>needs address and page allocation services and really wants page protection
>services.
>
>So they share a common underlying set of needs. Should the OS provide
>conventional model, object model, or the underlying service both needs. The
>answer is obvious and its precisely want unix does supply.
This is the "object" ambiguity again. Since EROS objects are pages, EROS basically unifies all of what Alan is after into a single layer, thus delivering better performance.
Also, this unification largerly eliminates his objections on ground of smart vs. dumb programmers. To put it another way: a whole hoard of smart programmers are getting 30% to 40% disk write bandwidth utilization. EROS, taking the dumb approach, does much better. The question isn't whether smarter is better. The question is whether better is better.
Jonathan S. Shapiro, Ph. D.
IBM T.J. Watson Research Center
Email: shapj@us.ibm.com
Phone: +1 914 784 7085 (Tieline: 863)
Fax: +1 914 784 7595