This is the response to the allocation policy question, which can be found below.
First, I want to emphasize that when we talk about "objects" in connection with EROS, we are talking about pages, not about programming language units.
In EROS, what it means to be a short lived object is that the object is allocated from a space bank (the EROS storage allocator), used a bit, and returned to the space bank. On return, the space bank zeroes the object and efficiently rescinds (using a kernel mechanism) all access to that object.
There are several layers of relevant optimizations.
First, if the object existed only within the current checkpoint interval, it will never go to the disk.
Second, if the object is subsequently *reallocated* within the current checkpoint interval, the old one will neither go to the checkpoint log nor to the "home location."
Third, if an object in the checkpoint area has already been re-modified at the time the migrator gets to it, it doesn't get moved to the home location out of the log. The correctness argument here is non-obvious, but this is safe.
Finally, newly zeroed objects are recognized as an important case by the implementation -- in particular zeroed pages. A page full of zeros is never actually written to either the checkpoint log (the write-ahead log) or the disk. In the log case, there is a distinguished value in the log directory entry saying "this object is zero." In the "home location" objects are divided into clusters of roughly 1000 pages. Each cluster has a side page that contains per-object metadata, including a flags field that indicates (among other things) whether the object is zero.
In effect, then, there are four layers of "hotness": stuff that only lives in memory, stuff that never makes it out of the checkpoint area, stuff that never gets copied by the migrator, and stuff that is copied using the efficient zero case. Bear in mind that the checkpoint area covers a small number of colocated tracks, and that the disk head is within this range with high frequency; the usual seek cost models don't entirely apply.
EROS is a single level store. For every object in memory we know exactly where on the disk it came from. The location on the disk is determined at the time the object is allocated rather than at the time the object is written. The key to having good read performance later is to do your layout planning when you allocate rather than when you write. I've hinted in previous notes at how the space bank accomplished this. In conventional designs, the problem with this is that it messes up your write locality. In EROS, the combination of the append-only log plus the bulk-sorted migration largely recovers write locality. Because the checkpoint area provides a transacted log, there is no need to order writes to cover metadata, which reduces the number of writes by nearly 50% relative to ext2fs (last time I looked -- these things definitely change!) for small files, and consequently reduces the number of disk seeks associated with write ordering. Eliminating the disk seeks is a big win -- you might find that your BDD example worked okay.
Rather than cause fragmentation, the memory checkpoint actually tends to force hot things to be accessable within the checkpoint area. Also, it permits bulk motion later, as previously described. Much of the modified memory state that is appended to the checkpoint area never gets written back to home locations, because it is remodified before the migrator can move it. Your BDD example is a great case in point. It heavily modifies a huge pile of stuff. At checkpoint time the whole thing gets marked COW, and then gets written in sequential order to the checkpoint log. By the time the migrator goes to migrate it, it's all been modified again so there is no further work to be done.
The fact that the checkpoint log write is sequential matters quite a lot -- there are very few seeks associated with this I/O, and to first order all of the I/O cost is in the seeks. There is still a rotational delay, and this shouldn't be trivialized, but with write-back track caches on modern drives a lot of that can be hidden.
In fact, if we take your example of 100Mb of data, and use a typical modern IDE drive, where a reasonable average number of sectors/track is 120, or 60 Kbytes/track. We need to write 1667 tracks to get the data out to the disk. On average, we will see 1/2 rotation per track plus one head-to-head or track-to-track seek (the two have roughly the same latency for adjacent tracks on a modern drive). Using the figures for the Quantum Fireball 1280AT, which isn't the newest disk on the planet, we get 5.56ms per full rotation plus 3ms per track-to-track seek, so we can get your data out in roughly 9.7 seconds, which is actually only a 3.2% slowdown in your example, which is THE WORST CASE. On this drive, the bus will keep up, so this is a reasonably good prediction.
Frankly, for the amount of simplification and overall performance that it gets us, I'll take a 3.2% overhead on the corner cases.
All that having been said, I'm not convinced that I've been clear enough about why these I/O's aren't noise relative to the allocator. Perhaps it will help to emphasize that when you allocate a page from a space bank you are allocating a real disk page in a known location (known to the space bank, at least), not a probabilistic page for which it is dearly hoped that a location will be found successfully later (e.g. consider that many UNIX systems oversubscribe the swap area).
So finally, we get to read locality.
There are really two parts to this -- metadata and prefetching. Mostly, you want a decently efficient way to get the metadata in, and then you want a decent way to prefetch when things really go to the wall. The real question is can you prefetch the metadata effectively so as to get a force multiplier on the page prefetching.
The answer, briefly, is yes. The space manager that allocates this object effectively allocates the object in preorder. Every metadata page fetched effectively tells you where the next 224 data pages are going to be. If you are feeling aggressive, you can in principle get all 224 pages loaded in a single pass over the disk, limited by your working set bounds. The kernel doesn't presently do any automatic prefetching, though we think we know how to do that.
Jonathan S. Shapiro, Ph. D.
IBM T.J. Watson Research Center
Phone: +1 914 784 7085 (Tieline: 863)
Fax: +1 914 784 7595
Iain McClatchie <email@example.com> on 06/23/99 08:10:19 PM
To: Jonathan S Shapiro/Watson/IBM@IBMUS
Subject: Questions about EROS
The discussion about EROS has been quite entertaining. I have three questions.
[First two questions deleted for brevity, see previous response]
A few months ago we had a wonderful debate on linux-kernel about the goodness of reiserfs, which is a new filesystem being developed by Hans Reiser. I took away from that debate a notion that any decently designed filesystem can find and read data already on the disk in a minimum of seeks. The number of seeks, and the overall performance of the filesystem is determined by their allocation policy: how they lay the data down on the disk.
The filesystem in a persistent object system has the same problem presented to it as the filesystem in a UNIX system, except (a) it is also expected to store many short-lived objects, and (b) the entire contents of memory are regularly swapped to disk. It seems that both (a) and, to a lesser extent, (b) would form a kind of noise would tend to swamp the signal that your filesystem allocator is looking for. If you think of the allocator as being something like the malloc() algorithm in a garbage collector, the programs in a UNIX system mostly seperate the long-lived objects from the short-lived objects before they get to the allocator.
Does EROS have some sort of system for implicitly or explicitly recognising long-lived versus short-lived objects?
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to firstname.lastname@example.org Please read the FAQ at http://www.tux.org/lkml/