[cap-talk] file I/O request overhead (was: kernel object knowledge)

Jed Donnelley capability at webstart.com
Mon Jun 11 02:00:07 EDT 2007


At 05:32 PM 6/10/2007, Jonathan S. Shapiro wrote:
>On Sun, 2007-06-10 at 15:57 -0700, Jed Donnelley wrote:
> > In an apples to apples
> > comparison I don't see how the microkernel implementation
> > can compete on performance.
>
>The literature tells an interesting tale. It depends very much on what
>we mean by "compete". L4Linux, for example, is +/- 0.5% percent relative
>to native Linux on the AIM application benchmarks. That is well within
>the range of variation of normal "native UNIX" vs. "native UNIX"
>comparisons. I think that qualifies as directly competitive by almost
>any definition.

Do you know anything about the sorts of domain separation
in L4Linux?  I imagine a lot could depend on the hardware.

I started looking into their papers, e.g.:

http://os.inf.tu-dresden.de/L4/LinuxOnL4/papers.shtml

Perhaps if you are familiar with some of them you could
point me to places where I might find out about how
they do their domain separation and how many such domain
changes happen in some of the calls that they compare in
their performance analysis?

I seem to be getting close in:

http://os.inf.tu-dresden.de/pubs/sosp97/

though the only 'Unix' for comparison seems to be
what they refer to as a "monolithic Unix Kernel,
Linux" that they run on top of their microkernel.

In the hardware we were using for NLTSS an "exchange"
(the only hardware means of protection) was very expensive.
I don't think the exact numbers are important (though I
could probably look them up), but on the order of 500
instructions.

In such a situation the cost of those exchanges dominates.
That was how we were able to get a 3x improvement by
"cheating".

> > There is a fundamental trade-off
> > between performance and integrity/modularity/reliability
> > if one achieves the additional integrity/modularity/reliability
> > by using additional domain changes.
>
>This is true, as is your subsequent comment that it is often possible to
>cheat.
>
>However, there is something that you do NOT address in your comments (or
>at least, not the ones that I have seen -- I have not been tracking
>closely) that may offset this.
>
>When a system is modularized, it often turns out that the mere fact of
>domain separation removes the need for certain kinds of defensiveness.
>For example, splitting a large system into domains may remove the need
>for certain locks by removing certain possibilities of interference from
>the design space.
>
>Norm Hardy reports that in KeyKOS they often found that domain
>separation was performance-neutral, because each domain could operate
>under "single threaded" assumptions that the monolithic version of the
>application could not safely make.

This was certainly not true in NLTSS.  Every server in our system
was fairly heavily multithreaded.  This was mostly because they
often had to do I/O and wait for completion while they were processing
requests.

However, I will note one area where we did get some performance
benefit that seemed a bit surprising.  Consider the file server
as a simple example.  The most common sort of "system call" that
would block simple processes (e.g. editors, compilers, loaders,
etc., typical single threaded process - not scientific simulations
optimized for parallelism) was file I/O.  When we ran the NLTSS
file server as a separate user level process we could tune the
scheduling so that under typical timesharing load several
I/O requests would be issued before the file server would
run.  That meant that when the file server started running,
it would have several requests to process before it needed
to block.  These separate requests were handled by separate
light weight threads (stack switching) that of course had
much less switching overhead than the domain switching.
That meant that when we were under timesharing load we
could get back some of that 3x performance loss due to
the domain exchanges.

Unfortunately our most demanding applications would run
at night when there was negligible time sharing load,
and some of them would bang on the Solid State Disk
that I mentioned previously that had a latency on the
order of a microsecond.  Since their I/O was almost all
that was happening, the domain exchanges ended up accounting
for nearly all our important system overhead.

>In short: *structural* improvements to systems often expose
>opportunities for performance improvements that did not exist in the
>pre-structured system.

I wouldn't consider my example above to fall into that
category - though I guess it's a matter of interpretation.
I don't know of any examples that we had that would seem
to fit that characterization.  We had a very good chance
for comparison in that we were running two systems side
by side, one a microkernel and the other a monolithic
kernel providing the same high level services.

If you think about simple operations like disk I/O,
there really isn't much to it.  You keep tables of
where the file data is on the disk and you map
file requests into raw disk I/O.  You can do that
pretty simply in a monolithic kernel.  You can also
do it with comparable simplicity in a microkernel.

>Given this, my sense is that the truth about performance in well
>engineered systems lies somewhere in the middle. My experience is that
>domain separation is not usually the source of performance cost in and
>of itself.

It certainly was for us.

>The source of added cost is data copies that occur on
>high-frequency paths.

Hmmm.  The only relevant "high frequency" data paths we
had was for file I/O.  In that case we had optimized so
that in the typical case I/O happened directly to the
memory of the application process - in both the monolithic
kernel and in the microkernel.  It was definitely the
cost of the domain exchanges that hurt us in the
microkernel case.

> From the cases I have looked at in detail, it
>isn't restructuring that creates performance costs. It's security
>isolation across asymmetric trust boundaries, because the known
>techniques for this involve copies.

This was also a source of some cost for us because we did have
to do data copies for the control information in requests
(e.g. copying capabilities in messages).  However, these
copies were very small typically relative to the size of
file I/Os for our systems (perhaps not as true today when
systems assume optimized caching for small file I/Os).
Also, as I mentioned, our domain exchange was hugely
expensive, I'm sure much more so than on today's systems.
That's a case where you can well argue that our experience
in the 1980s supercomputers may not be directly relevant
to today's systems.

>Note that no "apples for apples" comparison need consider these issues,
>because successful security isolation isn't (and cannot be) a serious
>objective in monolithic (therefore non-isolated) designs.

When you say that "security isolation" can't be a serious
objective in a monolithic kernel, do you mean simply because
there aren't any sub parts of the kernel to protect from each
other (security isolation)?

I feel that focusing on the file I/O cases provides quite a good
point of comparison.

>My observation
>is that where monolithic designs have attempted to address
>considerations of resource isolation, they end up with exactly the same
>copy costs as microkernel systems.

If you mean internally, then that sounds right to me.

>When these costs are introduced,
>their net performance can be worse than microkernel systems because
>their dynamic locality properties are less well tuned.

That could well be.  However, why would a monolithic kernel
system try to do internal "resource isolation".  The whole
premise of a monolithic kernel is that all it's code is
fully trusted.

In our NLTSS microkernel system, we separated out the file
server as a separate domain because it seemed to be naturally
isolated.  However, from a practical viewpoint if there was
a bug in the file server code it would quickly bring the
system down - whether in the monolithic kernel or in the
microkernel.  The microkernel might come to a stop a bit more
gracefully (with the file server faulted and the rest of
the processes quickly hung waiting on file I/O), but it
was still fatal (fortunately almost never happened...).

>No hard and fast rules or conclusions here. Mainly I'm trying to point
>out that the issues here are extremely subtle, and hard to talk about
>without considering specific cases.

For me focusing on a case like file I/O makes the
considerations facing any system much clearer.  There
just isn't that much to it.  Even if you take into account
the heavy caching that systems do these days (for us
doing the caching in the applications was preferable
for a number of reasons, not the least of which was
again the exchange costs), I think all the pieces
of the costs for file I/O are pretty clear.

For our specific cases (the NLTSS/LTSS comparison) the
situation was quite clear and I was right in there.
Perhaps I'm projecting those issues too much onto other
systems.  That's why I'd find a better understanding
of the L4Linux case or other examples of this sort -
specifically with regard to the file I/O - so interesting.
I know exactly all the pieces that I would be looking
for.  Any deviations would result in surprises.

In thinking about this a bit, I believe
the heavy file caching that systems do these days
results in essentially the same demand for low
latency file I/O request processing that we had to
deal with regarding our solid state disk.  It seems
to me that would suggest at least the sort of cheating
we did in our system or a monolithic kernel approach -
at least with regard to file I/O.  If the data for
the file I/O is coming from a memory cache, then
the performance for the I/O requests are going to
limited entirely by system overhead in the request
processing.

I'd be quite interested to see some side by side
data comparing where the system "overhead" (processing)
goes in some systems for such low latency file I/O -
e.g. in microkernel efforts vs. monolithic kernels
like the Unix variants.  Something like a time line
showing events along the way from the time a user
level application issues a library call to read
data from a file until the time that call completes.

--Jed  http://www.webstart.com/jed-signature.html 




More information about the cap-talk mailing list