[cap-talk] file I/O request overhead (was: kernel object knowledge)

Jed Donnelley capability at webstart.com
Mon Jun 11 22:51:08 EDT 2007


At 06:22 AM 6/11/2007, Jonathan S. Shapiro wrote:
>On Sun, 2007-06-10 at 23:00 -0700, Jed Donnelley wrote:
>
> > Do you know anything about the sorts of domain separation
> > in L4Linux?  I imagine a lot could depend on the hardware.
>
>So far as I know, there isn't any in the sense that you mean. From my
>understanding, the domains in an L4Linux implementation are:
>
>   1 domain per application
>   1 domain for the hosted linux kernel
>   various domains for the L4 system services, but so far as I know
>     these do not play an active role in the L4Linux behavior after
>     the L4linux system is initialized.

It's the above one domain per linux kernel that seems to me to
make any microkernel vs. monolithic domain comparison mute in
this case.

>In particular, L4Linux does not attempt to isolate drivers, TCP stack,
>and so forth. There was a follow-on project called SawMill that *did* do
>some amount of domain separation (though not much by our standards).
>Results for that can be found in:
>
>http://i30www.ira.uka.de/research/documents/l4ka/sawmill-multiserver.pdf
>
>Perhaps one of the L4 lurkers can expand on this -- I no longer recall a
>the details of those efforts.
>
>One comment: domain separation for the sake of recoverability wasn't
>their goal. The goals were to run multiple "personalities" and/or to
>make the operating systems more configurable for different machine
>classes. This motivates a qualitatively different from of domain
>separation and interface design choices.

Of course on this list it would seem that our focus should be POLA.
I know, for example, that in the NLTSS design, the file server
was quite easy to conceive as having limited authority.  It's
job was 'just' to accept I/O requests, turn them into requests
for the disk driver, and interpret the results from the disk
driver - returning control replies to processes that requested
I/O.

One thing that I'd like to expand on a bit in this message is
the possibility of language managed domain (object) separation
in OS design.  Using a language with strong object separation
would seem to provide an option (which we didn't have in the
last 1970s or through the 1980s) that could support POLA
in parts of the OS that are considered the 'kernel' from the
viewpoint of the hardware.  That is, one heavyweight exchange
to 'the system' could enter a realm of language enforced
separation between active objects that could, in principle,
have much lower overhead for transitions between what I
would call light weight (language) protected objects.

Is that an option that's been exploited in modern OS design?

> > In the hardware we were using for NLTSS an "exchange"
> > (the only hardware means of protection) was very expensive.
> > I don't think the exact numbers are important (though I
> > could probably look them up), but on the order of 500
> > instructions.
>
>Yes. Matters have gotten much worse over time.

Ah.  In that case it would seem that the language enforcement
option might be even a more valuable option.

>For my dissertation, EROS was measured on a 200Mhz Pentium II. Details
>at
>
>   http://www.eros-os.org/papers/shap-thesis.ps
>
>starting at p. 140.
>
>On that machine, it was possible to perform a minimal domain switch
>(i.e. one not involving an indirect string transfer) in EROS in xxx. By
>contrast, the minimum invocation latency for a *kernel* capability was
>3.69us (operation: get capability type. Compare linux getpid() at 3.00us
>on same hardware). One way IPC in general form took 3.58us, while the
>large/small case (very common) took 2.35us.
>
>Overall, I would say that the cycle counts (multiply microseconds by
>200) are roughly comparable to what you report for the NTLSS hardware.
>Of these, the largest single component was the hardware user/supervisor
>crossing mechanism, which took (from memory) 135 cycles.
>
>On current-generation Pentiums, it is not uncommon for a trivial domain
>switch to require 4000 cycles! There are multiple causes for this:

I assume hardware domain (memory map) switch here.

>+ CPU/memory speed ratios are much worse, making TLB reload relatively
>   more expensive. Until very recently, Pentium-family processors did
>   not implement TLB tagging.
>
>+ Pentium IV I-cache is 12k, but it is used for instruction expansion
>   into RISC micro-ops. When comparing to other implementations of IA-32,
>   the "effective" cache size is only 4k, so this hasn't scaled with
>   processor speed.
>
>+ Pentium IV cache is virtually indexed, therefore flushed on TLB
>   reload. Page global ("G") bits do not appear to be honored in the
>   flush.
>
>+ Pipelines are much deeper, aggravating the already large cost of
>   any tightly data dependent control flow (i.e. one for which the
>   interval between load and jump cannot be productively scheduled).
>   Note that in the typical IPC path neither static nor dynamic
>   instruction scheduling really helps with most of these -- they are
>   memory speed dependent.
>
>We'll have to see what the measurements look like for Coyotos. Should be
>able to get a slow path measurement pretty soon.
>
> > >Norm Hardy reports that in KeyKOS they often found that domain
> > >separation was performance-neutral, because each domain could operate
> > >under "single threaded" assumptions that the monolithic version of the
> > >application could not safely make.
> >
> > This was certainly not true in NLTSS.  Every server in our system
> > was fairly heavily multithreaded.  This was mostly because they
> > often had to do I/O and wait for completion while they were processing
> > requests.
>
>Hmm. Yes. Orthogonal persistence helps here by hiding the extra I/O
>threads in many cases.

As you say, hmmmm.  I'd be interested to know how this "orthogonal
persistence" works.  If 'we' (speaking hopefully) could focus on
the seemingly simple file I/O case (sorry if I seem to be selfish
in this regard - it does seem to me the most important case and
I understand it thoroughly in a previous context) I think the issues
might be pretty clear.

When a request comes in for file I/O, the software that handles it
(what we termed the file server) has to translate the request on
the file object into a request for a transfer from either disk
or disk cache.  If it is a copy from disk cache then perhaps it
can be initiated immediately (with a copy that I can't see how
to avoid at this point), but if the data needs to come from rotating
disk, then it would seem that at that point the software managing
the request needs to block while waiting for that disk I/O to
complete with considerable saved state (all the information about
the request, table pointers, disk pointers, etc., etc.).

We chose to have that state saved as the context of the thread
on the stack.  That meant when we got to the point of waiting on
the disk I/O the thread processing a request would block, resulting
in either a switch to another active file server thread or in a
domain switch out of the file server - if there was no other work
for the file server to do at that point.

I admit that this approach seems natural to me.  I'd be interested
to hear about alternatives.

>Norm and I also considered a software-implemented
>"probe" instruction that could be used to direct page fetch-ahead. I
>haven't gotten around to implementing that.

 From the above it sounds to me like we were faced with quite different
choices.  I expect this is likely the result of our dealing with
a rather limited case of a hardware architecture with no virtual
memory (base and bounds only), so our I/O was strictly for disk
I/O.  A "probe" for a page ahead would be no use in our context.

However, even in your case I would expect that in the usual situation
would reduce the essentially the same case - namely either transferring
data from disk cache or transferring data from disk.  I'd be interested
to hear if not.  In the former case then any "system overhead" is
particularly noticeable.  In the later case then it seems to me some
state must be saved while the disk I/O is completed.

>Not clear if this may account for the difference.
>
> > When we ran the NLTSS
> > file server as a separate user level process we could tune the
> > scheduling so that under typical timesharing load several
> > I/O requests would be issued before the file server would
> > run.  That meant that when the file server started running,
> > it would have several requests to process before it needed
> > to block.
>
>Yes. The KeyKOS/EROS checkpointing mechanism has similar impact.
>Further, it tends to generate bulk block-sortable I/O. In KeyKOS, the
>challenge was to *slow* the checkpoint I/O so that other work could make
>progress.

It does sound to me like the issues there are similar.

> > >Note that no "apples for apples" comparison need consider these issues,
> > >because successful security isolation isn't (and cannot be) a serious
> > >objective in monolithic (therefore non-isolated) designs.
> >
> > When you say that "security isolation" can't be a serious
> > objective in a monolithic kernel, do you mean simply because
> > there aren't any sub parts of the kernel to protect from each
> > other (security isolation)?
>
>Yes. In particular, there are no copy boundaries where data used by one
>subsystem is cleanly and separably handed off to another. Pointers and
>reference counts are the universal rule.

Then I'm with you.

> > Why would a monolithic kernel
> > system try to do internal "resource isolation".  The whole
> > premise of a monolithic kernel is that all it's code is
> > fully trusted.
>
>There are sources of failure other than trust failure. Recoverability is
>a fine reason for resource isolation.

I don't understand your use of the term "recoverability" here.
Can you explain how such "resource isolation" can aid recoverability?

>In modern systems there are also
>mixed workload (e.g. media) and capacity scheduling motivations.

You mean media other than disk?  With the modern systems that I'm
familiar with (all the Unix systems that I work on from servers
to supercomputers) it seems to me that the only 'media' that
are still relevant for performance issues are disk and network
(though usually internal for message passing, e.g. MPI).  Can
you give an example of some other media element that is important
for performance considerations?  Maybe graphics display output (not
considered a significant issue for our scientific computer benchmarks)?
Are there any others?

> > That's why I'd find a better understanding
> > of the L4Linux case or other examples of this sort -
> > specifically with regard to the file I/O - so interesting.
>
>The case and the performance rationale for checkpoint in the area of I/O
>is probably well known to you, but I tried to lay it out clearly in the
>following paper:
>
>  http://www.eros-os.org/papers/storedesign2002.pdf

Thanks Jonathan.  It looks like there is a lot more going on in the
design reflected there than we had in our file I/O situation.  I'll be
interested to read through that to see if I can make sense of all that's
going on.

>The paper attempts to explain not only what is going on, but why it
>generates competitive I/O performance in spite of moving more bytes in
>absolute terms.

Should be interesting reading, focusing on just the case that I'm
interested in.

One other thing just occurred to me, namely the effect of using
more flash memory for file I/O - e.g. the laptops that are coming
out with flash memories instead of disk.  I haven't heard about
any efforts along these lines for high performance systems (e.g.
flash caches), but this trend would seem to place even more
demands on lowering system overhead for I/O.

--Jed  http://www.webstart.com/jed-signature.html  




More information about the cap-talk mailing list