> 1. All objects are one of two types (pages or nodes).
> 2. All nodes are the same size; all pages are the same size
"All inodes are the same size, all disk blocks are the same size"
> 3. The metadata update problem is eliminated by the checkpoint logic.
As with any file system
> issues in conventional file system designs. There is no magic here; the
> leverage is similar to that of log-structured file systems, but the design has
> better behavior than either the Ousterhout or the Seltzer designs.
Thats one of the big points I am trying to make. You can call it an object store or a file system. It's the same thing.
> As far as I know, there are only three ways to lose part of an object store:
> 3. Physical loss of media (e.g. you lose a drive). If you're not properly
> duplexed, you're hosed.
4. Wanting to take some of it home on a floppy with you and put it in another machine. Your model of multiple storage pools has some of this covered, but leaves dangling cross reference problems. In essence taking a floppy out is a great nasty wack of state.
> EROS does need to take some care when it comes up to ensure that all duplexes
> are consistent with each other. This is simple, and transparently done with no
> administrative intervention.
Yep. Well understood problem
> I don't know how microkernelish the EROS core is -- I don't claim that it's a
> microkernel (at least, not any more). Also, size comparisons are difficult
> because even the best C++ compilers introduce a lot of overhead. In retrospect,
> writing the kernel in C++ was a mistake.
True. I'd not allowed for the C++ mistake.
> certainly need to construct and benchmark the net stack, but there is no reason
> to believe that it will perform any worse than any other network stack. Ditto
> graphics. Actually, most of the other drivers you might expect are present.
THe network stack I will be very interested in seeing figures for. That is one of the biggest tests of real world throughput and latency - the hard bit being the amount of stuff occuring in parallel [rule of thumb - slower connections are harder than fast ones for the same volume of traffic in networking]
> For large files, the inode read (actually indirection block read) overhead isn't
> that big, because the ratio of indirection blocks to content blocks is low. For
> *growing* files it's another story, but to some extent you can make that into a
> single long transaction if you are a little careful about write ordering. Note
> that this is an area where Linux could be improved.
Definitely. There is a log structured around (dtfs), and Stephen is working on journalled fs stuff. We also have XFS coming. Right now we don't have a good log/journalled fs.
> sequential writes. It's about bulk block transfers. As you modify small
> objects, changes are written to a write-ahead log, which is where the disk head
> spends most of it's time. Empirically, you are rarely more than three tracks
> from where you want to be. This significantly reduces seek time. Also, the log
> is logically append-only, which means that few writes (even if they are to
> different objects) require a seek at all. In this sense, there is indeed a
> large sequential block write occuring.
That I would argue is a property of good file system design not a fundamental change in programming models of any kind.
> This idea could be adapted to Linux if we went to a virtual disk model, wherein
> disks hold logical block ranges and all file system operations are done on
> logical blocks. The same bulk write logic could then be applied between the
> buffer cache and the backing store.
The Linux VFS issues requests to device, block pairs. Neither the device nor the block need to be real. A classic example would be striped LVM where the disk and block numbers are both firmly in the imaginary regime but the block device side turns them into real I/O
> Part of the disconnect here, I think, is that Alan seems to be using "object" in
> the programming language sense rather than the operating system sense. From the
Object to me is "anything you can get a handle for". Programming objects, memory objects, filestore objects. Anything you can name in some way is an object.
> Agreed, but pages and nodes don't change size. There is a disconnect between
> what you are calling an object and what the EROS persistence layer calls an
> object. At the disk level, all objects are blocks, period. The issues you
> raise cease to be an object store problem if system-wide checkpointing of blocks
> is used.
They reduce to a multiblock atomic update problem. Which with a log is reasonably sane. Agreed.
> >2. You need to position objects/files for locality
> >3. You need fragmentation resistant algorithms
> In EROS, (2) and (3) are the responsibility of the space bank, which allocates
> objects in localized extents. By making suitable use of these banks,
> applications get extent-based allocation.
Is the user able to degrade the system by choosing to be intentionally dumb about space bank usage ? Is that degradation short or long term
> Ditto with regard to reading my next note, but how to you maintain any kind of
> security under your proposal?.
> Let's take a fundamentally more interesting metric: online VISA transaction
> processing, where the transactions rates are an order of magnitude higher and
> involve data mutation of small records rather than bulk transfer of news
> articles, which are write-once and update-never, and therefore pretty easy to do
> well on. Take a look at "Object Oriented Transaction Processing in the KeyKOS
The biggest problem of Usenet is not the bulk transfer, it is the amount of crap you have to keep a copy of locally indexed, refcounted and expiring.
> dumb programmers. To put it another way: a whole hoard of smart programmers are
> getting 30% to 40% disk write bandwidth utilization. EROS, taking the dumb
> approach, does much better. The question isn't whether smarter is better. The
> question is whether better is better.
The more application useful disk bandwidth achieved the better clearly.