[cap-talk] file I/O request overhead (was: kernel object knowledge)
Jonathan S. Shapiro
shap at eros-os.com
Mon Jun 11 09:22:08 EDT 2007
On Sun, 2007-06-10 at 23:00 -0700, Jed Donnelley wrote:
> Do you know anything about the sorts of domain separation
> in L4Linux? I imagine a lot could depend on the hardware.
So far as I know, there isn't any in the sense that you mean. From my
understanding, the domains in an L4Linux implementation are:
1 domain per application
1 domain for the hosted linux kernel
various domains for the L4 system services, but so far as I know
these do not play an active role in the L4Linux behavior after
the L4linux system is initialized.
In particular, L4Linux does not attempt to isolate drivers, TCP stack,
and so forth. There was a follow-on project called SawMill that *did* do
some amount of domain separation (though not much by our standards).
Results for that can be found in:
http://i30www.ira.uka.de/research/documents/l4ka/sawmill-multiserver.pdf
Perhaps one of the L4 lurkers can expand on this -- I no longer recall a
the details of those efforts.
One comment: domain separation for the sake of recoverability wasn't
their goal. The goals were to run multiple "personalities" and/or to
make the operating systems more configurable for different machine
classes. This motivates a qualitatively different from of domain
separation and interface design choices.
> In the hardware we were using for NLTSS an "exchange"
> (the only hardware means of protection) was very expensive.
> I don't think the exact numbers are important (though I
> could probably look them up), but on the order of 500
> instructions.
Yes. Matters have gotten much worse over time.
For my dissertation, EROS was measured on a 200Mhz Pentium II. Details
at
http://www.eros-os.org/papers/shap-thesis.ps
starting at p. 140.
On that machine, it was possible to perform a minimal domain switch
(i.e. one not involving an indirect string transfer) in EROS in xxx. By
contrast, the minimum invocation latency for a *kernel* capability was
3.69us (operation: get capability type. Compare linux getpid() at 3.00us
on same hardware). One way IPC in general form took 3.58us, while the
large/small case (very common) took 2.35us.
Overall, I would say that the cycle counts (multiply microseconds by
200) are roughly comparable to what you report for the NTLSS hardware.
Of these, the largest single component was the hardware user/supervisor
crossing mechanism, which took (from memory) 135 cycles.
On current-generation Pentiums, it is not uncommon for a trivial domain
switch to require 4000 cycles! There are multiple causes for this:
+ CPU/memory speed ratios are much worse, making TLB reload relatively
more expensive. Until very recently, Pentium-family processors did
not implement TLB tagging.
+ Pentium IV I-cache is 12k, but it is used for instruction expansion
into RISC micro-ops. When comparing to other implementations of IA-32,
the "effective" cache size is only 4k, so this hasn't scaled with
processor speed.
+ Pentium IV cache is virtually indexed, therefore flushed on TLB
reload. Page global ("G") bits do not appear to be honored in the
flush.
+ Pipelines are much deeper, aggravating the already large cost of
any tightly data dependent control flow (i.e. one for which the
interval between load and jump cannot be productively scheduled).
Note that in the typical IPC path neither static nor dynamic
instruction scheduling really helps with most of these -- they are
memory speed dependent.
We'll have to see what the measurements look like for Coyotos. Should be
able to get a slow path measurement pretty soon.
> >Norm Hardy reports that in KeyKOS they often found that domain
> >separation was performance-neutral, because each domain could operate
> >under "single threaded" assumptions that the monolithic version of the
> >application could not safely make.
>
> This was certainly not true in NLTSS. Every server in our system
> was fairly heavily multithreaded. This was mostly because they
> often had to do I/O and wait for completion while they were processing
> requests.
Hmm. Yes. Orthogonal persistence helps here by hiding the extra I/O
threads in many cases. Norm and I also considered a software-implemented
"probe" instruction that could be used to direct page fetch-ahead. I
haven't gotten around to implementing that.
Not clear if this may account for the difference.
> When we ran the NLTSS
> file server as a separate user level process we could tune the
> scheduling so that under typical timesharing load several
> I/O requests would be issued before the file server would
> run. That meant that when the file server started running,
> it would have several requests to process before it needed
> to block.
Yes. The KeyKOS/EROS checkpointing mechanism has similar impact.
Further, it tends to generate bulk block-sortable I/O. In KeyKOS, the
challenge was to *slow* the checkpoint I/O so that other work could make
progress.
> >Note that no "apples for apples" comparison need consider these issues,
> >because successful security isolation isn't (and cannot be) a serious
> >objective in monolithic (therefore non-isolated) designs.
>
> When you say that "security isolation" can't be a serious
> objective in a monolithic kernel, do you mean simply because
> there aren't any sub parts of the kernel to protect from each
> other (security isolation)?
Yes. In particular, there are no copy boundaries where data used by one
subsystem is cleanly and separably handed off to another. Pointers and
reference counts are the universal rule.
> Why would a monolithic kernel
> system try to do internal "resource isolation". The whole
> premise of a monolithic kernel is that all it's code is
> fully trusted.
There are sources of failure other than trust failure. Recoverability is
a fine reason for resource isolation. In modern systems there are also
mixed workload (e.g. media) and capacity scheduling motivations.
> That's why I'd find a better understanding
> of the L4Linux case or other examples of this sort -
> specifically with regard to the file I/O - so interesting.
The case and the performance rationale for checkpoint in the area of I/O
is probably well known to you, but I tried to lay it out clearly in the
following paper:
http://www.eros-os.org/papers/storedesign2002.pdf
The paper attempts to explain not only what is going on, but why it
generates competitive I/O performance in spite of moving more bytes in
absolute terms.
shap
More information about the cap-talk
mailing list