[cap-talk] file I/O request overhead (was: kernel object knowledge)

Jonathan S. Shapiro shap at eros-os.com
Tue Jun 12 03:14:47 EDT 2007


On Mon, 2007-06-11 at 19:51 -0700, Jed Donnelley wrote:
> At 06:22 AM 6/11/2007, Jonathan S. Shapiro wrote:
> >On current-generation Pentiums, it is not uncommon for a trivial domain
> >switch to require 4000 cycles! There are multiple causes for this:
> 
> I assume hardware domain (memory map) switch here.

Yes. "Trivial" switch involves an address space reload, but all payload
is registerizable (thus no cross-space string copy).

> >Hmm. Yes. Orthogonal persistence helps here by hiding the extra I/O
> >threads in many cases.
> 
> As you say, hmmmm.  I'd be interested to know how this "orthogonal
> persistence" works...

I gave a link to the paper in an earlier email, but here are the two
papers to read:

Landau's "Checkpoint Mechanism in KeyKOS":
  http://www.cis.upenn.edu/~KeyKOS/Checkpoint.html
My own "Design Evolution of the EROS Single-Level Store":
  http://www.eros-os.org/papers/storedesign2002.pdf

Oh. I then saw below that you had seen these. Charlie's paper is an
interesting read too, and looking at it first will give a sense of the
evolution history.

>   If 'we' (speaking hopefully) could focus on
> the seemingly simple file I/O case (sorry if I seem to be selfish
> in this regard - it does seem to me the most important case and
> I understand it thoroughly in a previous context) I think the issues
> might be pretty clear.

Unfortunately this is hard to do on an "apples for apples" basis,
because in the idiom of a persistent system one does not use files.
Indeed, one advantage of persistence is that one is not forced to make
data copies at the file/address-space boundary. Often one can simply
memory map and share the state.

While modern UNIX systems implement unified buffer/memory caches,
programs are amazingly stupid about I/O sizes and alignment. One reason
that stdio looks like a reasonable implementation is simply that it gets
its block I/O placement decisions write. The stdio library otherwise has
quite a high overhead.

> When a request comes in for file I/O, the software that handles it
> (what we termed the file server) has to translate the request on
> the file object into a request for a transfer from either disk
> or disk cache...

In EROS/KeyKOS, the closest analogous case is that a memory reference is
made to a page that is not in memory. The object ID (OID) of the page is
provided to the storage manager, which fetches it. In practice, the OID
is a direct encoding of the disk location (rather: it falls within some
OID range known to the storage manager, and that entire object range is
stored sequentially on the disk. In later versions, that range would
have been an extent.

  Comment: in a persistent system, processes do not die unless (a)
  they exit voluntarily, or (b) you forcibly reclaim their storage.
  This means that many "fileish" idioms from conventional systems
  turn into "paging" idioms.

  The impact of this simple change is very wide, and it violates
  many assumptions about system behavior. It still trips *me* to
  switch hats, and I've been working with persistence for 15 years
  now. I mean, heck, I wasn't even half-bald when I started doing
  this stuff. :-)

So there is no "file system", and there is nothing that obviously
corresponds to a file indirect block (which are the major source of
induced seek latency in file I/O). The "structure" of files in our
systems are built out of Nodes or GPTs. These implement roughly the same
sort of hierarchy as file indirection blocks, but they are first-class
objects.

We don't eat the same seek latencies on Nodes/GPTs because (1) we do a
better job localizing them for read than UNIX historically did [this is
probably no longer true - UNIX has improved], and (2) we do batch-sorted
transactional I/O, so there are no ordering requirements on
data/metadata updates. UNIX has ordering issues here in order to ensure
that fsck can be run successfully later. The ordering dependencies
interact with the disk arm seek policy in ways that tend to multiply
latencies. The killer is that you want those blocks co-located for read
purposes, so the disk head really is going to visit the same tracks
twice. Add read-after-write to confirm the I/O and you have 4x the
naively expected latency!

Much of the KeyKOS/EROS IO performance advantage has disappeared as
log-structured file systems have come in to wide use. I think we still
have advantages in minor ways, but they *are* minor. The only remaining
major win is that our scheme transitions gracefully from temporal
locality driven placement to spatial locality driven placement in the
store. Among the log-structured systems, only EXT3 has a good story for
this.

> If the data needs to come from rotating
> disk, then it would seem that at that point the software managing
> the request needs to block while waiting for that disk I/O to
> complete with considerable saved state (all the information about
> the request, table pointers, disk pointers, etc., etc.).

In our case the saved state is minimal (the OID and object type that was
requested by the given disk IO operation, plus a pointer to the object's
memory frame). The interface at the disk hardware layer is inherently
asynchronous, so the disk driver doesn't block in the way that you seem
to describe.

I should shortly be able to point you at a browsable Coyotos source tree
that includes the storage manager upcall so you can see the entire
chain.

> We chose to have that state saved as the context of the thread
> on the stack.

Register sets have grown. A thread context on a modern machine is
~2KBytes or more. The context information for the actual IO is ~64
bytes. Your implementation is very easy to understand, but it doesn't
seem likely to scale well. The thread context size is very close to the
average client request size (typically one page => 4Kbytes on most
systems).

> >Norm and I also considered a software-implemented
> >"probe" instruction that could be used to direct page fetch-ahead. I
> >haven't gotten around to implementing that.
> 
>  From the above it sounds to me like we were faced with quite different
> choices.  I expect this is likely the result of our dealing with
> a rather limited case of a hardware architecture with no virtual
> memory (base and bounds only), so our I/O was strictly for disk
> I/O.  A "probe" for a page ahead would be no use in our context.

Perhaps. I suspect that the difference is more pervasive. The decision
to use or not use persistence is a fairly foundational decision in the
architecture.

That said, it's probably not possible to do OS-based persistence without
a paging system (though, hmm, software-enforced memory range checking
probably does not actually have more overhead than the hardware variety.
See, e.g., the sandboxing work that Chris Small did for the Vino system,
which has subsequently been improved on).

> However, even in your case I would expect that in the usual situation
> would reduce the essentially the same case - namely either transferring
> data from disk cache or transferring data from disk.  I'd be interested
> to hear if not.  In the former case then any "system overhead" is
> particularly noticeable.  In the later case then it seems to me some
> state must be saved while the disk I/O is completed.

For reads, what you say seems very likely. In our case, the disk cache
is main memory, so no domain overheads really apply. For writes, we are
doing bulk block-sorted I/O that is not directly being driven by any
particular writing domain.

The wildcard is the on-controller disk cache. On modern commodity drives
this is 8M, and it is surprisingly hard to model (see the work by Wilkes
on disk drive modeling). One hidden advantage to bulk write is that you
tend to overwhelm the disk cache during write, with the effect that its
behavior becomes predictable. It occurs to me that I should try the
experiment of disabling the on-disk cache and see what the impact is.

> > > Why would a monolithic kernel
> > > system try to do internal "resource isolation".  The whole
> > > premise of a monolithic kernel is that all it's code is
> > > fully trusted.
> >
> >There are sources of failure other than trust failure. Recoverability is
> >a fine reason for resource isolation.
> 
> I don't understand your use of the term "recoverability" here.
> Can you explain how such "resource isolation" can aid recoverability?

Consider a TCP/IP stack. The TCP/IP stack can go completely belly up and
be restarted successfully. Connections will be lost, but the system as a
whole will not crash. In fact, the TCP/IP stack can be multiply
instantiated on a per-client basis, as we do. In that case even most
applications will not get hurt.

But this only works if TCP/IP packets are not allocated from a common
storage pool. In that case, the logical isolation is not reflected in
the implementation, and restart after a pointer error becomes, umm,
"delicate".

> >In modern systems there are also
> >mixed workload (e.g. media) and capacity scheduling motivations.
> 
> You mean media other than disk?

No. Sorry. I meant (e.g.) real-time media applications running
concurrently with non-real-time applications. There is a phenomenon
known as "resource cross-talk" that impedes tight real-time guarantees.
The easiest solution to resource cross-talk is resource accounting,
which entails not sharing things in certain cases.

> One other thing just occurred to me, namely the effect of using
> more flash memory for file I/O - e.g. the laptops that are coming
> out with flash memories instead of disk.  I haven't heard about
> any efforts along these lines for high performance systems (e.g.
> flash caches), but this trend would seem to place even more
> demands on lowering system overhead for I/O.

There has been some work in this area -- including some amazingly
unjustified patents. The benefits are not as obvious as they seem.

For reads, the benefit is pretty clear, but with modern memory sizes the
swap/paging area is almost never touched on single-user systems, so it
is not clear that one would actually observe the benefit in practice.
The only reason I configure a swap area on my Linux laptop at all is
respect for tradition (well, and the occasional *rare* case where I do
some big compile, but I'm not a typical user).

For swap/cache writes it is generally possible to schedule and block out
multiple sequential writes. Further, it is almost never the case that
anything is blocked on the completion of those writes, which means that
the write latency is largely irrelevant. Given memory sizes, the
pressure to free frames just isn't that high, and the associated
interrupts are infrequent. Write *bandwidth* may be a much more
important factor. I haven't looked at the write rates on flash recently,
but I believe that there have been relevant technology changes.

Taking all of this into account, I would expect to see only a modest
performance improvement from flash adoption in single-user systems.
Server systems are another matter entirely. In current laptops, the
flash subsystem is mainly being used to hide the sheer embarrassment
that is Vista boot latency.


shap



More information about the cap-talk mailing list