[cap-talk] Object-capability vs. monolithic performance
Jed Donnelley
capability at webstart.com
Thu Jan 4 18:24:04 CST 2007
At 06:29 AM 1/4/2007, Jonathan S. Shapiro wrote:
>[Subject change. Thank God for Google. I'ld *never* find anything on
>this list if Google didn't index us.]
>
>Warning: mild flares ahead, but with humor.
>
>On Wed, 2007-01-03 at 17:07 -0800, Jed Donnelley wrote:
> >At 09:42 AM 1/3/2007, Marcus Brinkmann wrote:
>
> > >It's hard to argue with your personal judgement of "good performance",
> > >but I will admit your point. However, I think there are many
> > >applications where "good" may not quite be "good enough". As test
> > >cases, I suggest Gigabit Ethernet and live audio processing. That's
> > >not particularly high-end, but typical end-user requirements these
> > >days.
> >
> > With appropriate design any "call" overhead from the capability
> > interface is comparable to that for a typical system call and
> > has no impact on bandwidth intensive applications like
> > Gigabit Ethernet or live audio processing.
>
>Jed: since my lab actually *did* this
Did what? Compared two system call interfaces for performance, one a
wrappable/mappable object-capability call interface and another not
(hard to describe, e.g. whatever it comes down to in a Unix system)?
It sounds (below) as though you were comparing two internal implementations,
not the cost of the system call interface. It's the relative cost of the
system call interface that is the issue I'm concerned about. More
details below.
>three years ago, I feel pretty
>confident that your experience from the 1980s is not predictive. The
>necessary latency requirements for Gbit ethernet are sub microsecond and
>the packet rate is many decimal orders of magnitude higher than anything
>a disk can put out even today.
I believe that in our time frame the latency requirements for the solid
state disk had those same properties. Remember, we're talking about
the latency of access to memory as seen through a file system
interface. Of course it was absolutely impossible to get the latency
through a "file" read/write capability invocation down to the level of a memory
fetch/store, but fortunately we didn't have to do that. 'All' we had to do
was to get the latency down to the level that was available through the
competing more traditional system call interface to a monolithic kernel.
What we found (obvious if you think about it) was:
1. The latency for this user level service was bounded below by the
latency of the domain exchange mechanism from the user to the
"system." At some point no amount of work could improve that cost,
whether through the more traditional system call interface to the
monolithic kernel system or through the wrappable/mappable
object-capability interface. The cost of this exchange was
rather high for us - some thousands of instruction times.
2. There were initially substantial additional costs in our micro-kernel
implementation, but these costs were dominated by the domain
change overhead in switching between the protected modules within
the system. In our case our file server ran in user mode. This meant
that to process a file I/O call (independent of the interface) we had to
exchange to the kernel, then back up to the file server, then back
to the kernel, and finally back to the user. By moving the file server
into the kernel (even with the same multithreaded code, same
processing of requests as through a capability invocation, etc., etc.)
we reduced our latency by about this factor of two, bringing it in line
with the latency for the monolithic kernel system - which was deemed
adequate for the solid state disk access application (though I'm not
sure why as higher performance was possible if application level
code could run in the kernel).
Perhaps you can imagine how many years (literally) I spent pouring
over system traces, cycle analyses, etc. in the effort to improve
this performance and how much work was involved by my whole
team with this "server in the kernel" effort (the way we referred
to it). There was nothing "glib" about it. We were fighting for our
lives. As I recall this performance improvement was one of the
milestones that we had to reach for "Big P" production:
http://www.computer-history.info/Page5.dir/pages/Chronicles.dir/images/big-p.gif
(for your amusement, part of:
http://www.computer-history.info/Page5.dir/pages/Chronicles.dir/index.html
). That's me with the hat at the rightmost, chasing that big P. The
Condor flying over us is one of our benchmark applications from the time, also
seen in this cartoon:
http://www.computer-history.info/Page5.dir/pages/Chronicles.dir/images/nltss-road-work.gif
where I'm again the one in the hat.
One thing we were particularly proud about in this effort was that
at the end of the process we could basically throw a switch and
build a system with any given service process in the kernel
(lowering overhead but increasing risk) or out of the kernel
(reducing overhead but decreasing risk). To quite a low level
the server code was the same. This "switch" throwing basically
amounted to loading a different set of libraries for the two different
cases.
Of course for our production systems (with their requirement for
low latency access to the solid state disk through the file
I/O interface) we never ended up building another system with
the file server as a user level process. Still, the opportunity
was there and we did build such systems for testing. The users
were content and we were content. From out perspective the user
interface was still wrappable/mappable. This file server in the kernel
could even receive requests directly from remote user applications
on the network and process them appropriately (safely) just as it did
for local "user" or "system" processes.
I guess every generation has some challenges which, if met,
result in a certain amount of pride. Your $500 challenge below
suggests such an area for you. You can see at least one of
ours in the above.
Since I side tracked so far anyway, I'll mention again for emphasis
that all the above work was done without any need for changes in the
user level "system call" interface that was essentially wrappable/mappable
object-capability. While this consistency made life easier for our users,
it wasn't strictly necessary. All our applications used libraries
that effectively
hid the base system call interface. We could (at high cost) and occasionally
did change the base "invocation" call (for us it was a "communicate" call,
the only "system" call that our system supported). This required our
codes to be reloaded with new low level libraries (perhaps the equivalent
of glibc), but the applications themselves didn't need to change. I'll note
that the only times we found it worthwhile to do so were for functional
reasons, not for performance reasons - though as you see above performance
was vital to us.
>Also (as you acknowledged indirectly
>later in your note), a system designed with proper isolation isn't going
>to get away with one system call per packet, while a monolithic design
>typically gets away with significantly *less* than one system call per
>packet.
When you say "system call" above, I map that to something like
"cross domain call". If one is able (as we were) to turn any such
"cross domain call"s internal to the system's implementation into
a subroutine call, then I argue that the performance can be as
good as with any implementation in a monolithic kernel (see diagram/
discussion below).
>Yes, we can achieve comparable performance with a comparably monolithic
>and unprincipled design (as you apparently did),
Of course I bristle at "unprincipled". I believe we had the highest possible
"principles", including that our user's requirements were paramount.
>but in doing so we
>would lose most of the benefit of object-capability systems.
I guess this is a matter of what "most" means to each of us. For tight
performance requirements and especially with a moderately expensive
domain change (can be dictated by the machine hardware as it was
for our case - though there was a minor additional cost imposed in
our case by the language imposed subroutine convention, basically
caller vs. callee saving of registers) one has no choice but to sacrifice
protection domain exchanges (which can provide their own value in terms
of integrity, reliability, analyzability, etc.) for performance.
What I'm arguing is that even given such sacrifices, the functional
wrappability/mappability and object-capability nature of the "call"
interface need not be sacrificed for performance.
>Getting good gigabit ethernet performance with a properly POLA-based
>design on a modern, fast processor and fast memory system is goddamn
>hard. The architectural challenge lies in the need to simultaneously
>maintain a zero-copy networking design and a set of protection barriers
>between the wire and the application, all while responding with 1
>microsecond latency (or less). For large packets there is no real
>problem. The case that it completely ball busting is floods of small
>packets. It is tempting to think that all of this is slow interactive
>character traffic, but that characterization of small packets hasn't
>been true since at least 1985. Small packets now make up a surprisingly
>large portion of the network packet demographics, and many of the
>protocols involved are high speed and/or low latency.
I understand all this. I've been there many times myself - though in
different contexts I accept. All I'm taking issue with is the dependence
on the user interface. Let's see if I can "draw" a picture. You start with
some user call, generically:
/
Net send/receive: | An exchange happens, "system" code runs
\
One can implement the "system" code with an internally POLA
architecture with domain crossings between various modules. Fine.
Doing so has a latency cost. It also has a modularity, integrity,
possibly reliability, etc. value. If performance is the ascendent
requirement (as in my experience it usually is), then these internal
values may need to be sacrificed to achieve the needed performance.
One can (at least we were able to) keep the internal modularity but
turn the domain boundary crossings into subroutine calls to meet
performance requirements at the cost of protection that typically
(as it's all internal to the system's implementation) is not vital.
What I argue is that the interface for the base "Net send/receive" can
still be a wrappable/mappable object-capability interface. (In our
case Net send/receive was not an object invocation, but rather
tied directly into the one system call, but I don't think that aspect of
things is relevant here).
I guess I should mention here that if one does wrap that interface
(that is one adds a level of emulation through an extension mechanism),
then of course that will add to the latency of the operation. Is that
an issue in this discussion?
>In the torture case, the best we were able to do in EROS was about 80%
>of wire speed, which was about 10%-15% worse than linux, and at that I
>suspect that our CPU utilization was much higher than Linux. [Aside: we
>still beat the living crap out of Windows.] The primary source of the
>cost was restrictions that arose from a purely synchronous IPC
>mechanism. The problem wasn't the call overhead per se, but the
>blocking, unblocking, and context switching entailed. Based on
>measurement, the context switch overheads are a substantial proportion
>of the total processing time.
If they are internal to your implementation then I argue that you had/have
the same opportunity to optimize them away while still keeping your
same basic modular structure - just with the barriers lowered.
> > To meet these performance goals in some cases
> > we needed to push some of the service processes (e.g. the
> > "file" server) into what amounted to the kernel (cutting out
> > two of four domain changes), but we were still able to keep
> > the interface the same, faithful to the object-capability
> > model.
>
>Ah. So you cheated. :-)
There you go. I argue that opportunity was also available to you, and
without sacrificing the wrappable/mappable object-capability
interface for the user network send/receive call.
> > The real issue (from my experience) is the number
> > of domain transitions required for a given call. This can
> > be reduced as much as necessary at the cost of
> > weaker integrity, reliability, analyzability, etc. in the
> > resulting monolith.
> >
> > That is, the trade-off isn't in the interface, it's in the
> > service implementation.
>
>Yes, the issue is domain switches, but much of this in the ethernet case
>is driven by choices in the IPC mechanism.
Are you referring to the IPC mechanism within the implementation
or as seen by the user level application? In the former case,
"cheat" away as needed. If the later then we have some details
to get into.
>Good. Now I will switch sides.
>
>Given what we know today about high-performance microkernel
>implementation, I claim that we can now achieve performance comparable
>to monolithic systems *without* accepting weaker integrity, reliability,
>or analyzability when measured on an application benchmark basis. We
>cannot afford gratuitous domain switches, but we can afford enough to
>preserve these properties.
OK, I'll also switch sides (on this narrow issue, not on my fundamental
argument for wrappable/mappable object-capability "user" interfaces).
I believe that no matter what efforts you make to achieve performance
comparable to a monolithic kernel and still preserve your internal
domain changes for stronger (integrity, etc.), I can come up with
a monolithic implementation that beats it by something like the
cost of those internal domain changes. Doesn't this seem obvious?
In fact I can use your own implementation with the barriers reduced
to subroutine calls to compete against you. Why fight it?
>I assume, for this claim, a decently modern TLB implementation (as is
>presently available on all modern processor implementations -- even
>later Pentiums) and a consistent cache (which rules out the widely
>deployed ARM parts, but not the later core designs).
>
>I'm tempted to require a finite set of registers on the processor, but
>I'ld only be saying that to pull Alan's nose a bit. Itanium does a
>surprisingly fast switch if the implementation is careful.
The register business is where you can get into fixed domain
change costs being greater than what amounts to a subroutine
call cost. It seems to me that the trade-offs are pretty obvious
in this area. I don't feel we're gaining much from the discussion
at this point.
What is most important to me is hearing any argument about
the user level interface. That is anything to the effect that there
is some reason that this can't be a wrappable/mappable
object-capability interface because (abc - including performance
considerations). Remember, I'm not arguing that there won't
be an inevitable cost for doing actual wrapping, just that until
such wrapping is done, the cost of the interface need not be
significantly (on the order of a single subroutine call vs. any
alternative monolithic kernel call - the call, not the internal
implementation) different from that for the same service in
a system like Unix or Windows.
>I set application benchmarks as the basis because there will always be
>some small number of uninteresting operations where one system is slower
>than another, and people use these to pick nits.
>
>Jed: You're a very talented guy. You really need to stop setting such
>low bars for yourself and the community. It's embarrassing. :-)
What "low bar" do you believe I've set?
> > >Given that
> > >even a single level of indirection makes these systems non-competitive
> > >with traditional monolithic kernels without object/capability system
> > >should make one careful about any exaggerated claims.
> >
> > I described my approach to dealing with a "single level of indirection"
> > above. Namely, eliminate it where needed, smash the mechanism
> > into a monolithic kernel (which may be needed to provide the required
> > performance), but leave the interface the same.
>
>Can't be done. The level of indirection in question is the indirection
>necessary to the implementation of protection in the microkernel.
I hope the discussion above has clarified this point. Of course I
bristle a bit when I hear somebody say that something that we
demonstrably did "Can't be done." There is a protection boundary
inevitably crossed for the initial user level call. There may be
additional domain crossings internal to the implementation.
I don't argue that such crossings don't provide protection value,
just that they can be given up to achieve user level performance,
while still maintaining a wrappable/mappable object-capability
system call interface for the user.
>No. Marcus is quite right here. The state of modern microkernel
>engineering is at the point where a difference of 3-5 cache misses in
>the total IPC path is the difference between winning and losing, and
>this stuff is engineered *way* better in modern systems than your glib
>"smash and eliminate" implies.
Of course there was nothing "glib" about the two years of work that
we put into improving the performance of our production operating
system by eliminating two of the four domain changes internal to
the file read/write calls for the solid state disk case (didn't matter
for rotating disk case where the latencies were in the millisecond
range).
I'm delighted if the costs for domain crossings (IPC interfaces)
are less on todays systems. Great. That gives you an opportunity
to keep more of those internal protection boundaries and still
meet your performance requirements (e.g. your Gigabit Ethernet
example).
I still suggest that if you find yourself in a situation where the
difference between an exchange mandated by an IPC interface
vs. the presumed lower cost of a subroutine call makes a difference,
then you have the option of turning that IPC interface into a
subroutine call - While Still Maintaining the functional value of
the wrappable/mappable object-capability interface for the application
level user.
>I will pay you $500 if you can eliminate 80 cycles (just two L1 cache
>misses) from the production L4 implementation's IPC subsystem in less
>than a year's effort. You get to pick the architecture. It has to be a
>mainstream, modern L4 production implementation (i.e. not one of the
>research experiments). You don't get to alter the interface or
>compromise the L4 protection model (such as it is). You *do* need to
>explain how you did it to collect.
Not my area of expertise or interest. $500 is less than one days
consulting fee. It would have to be honor and not reward that would
motivate me, but I don't have any stake in that area.
>When Coyotos is up and running, I'll put out a higher bounty for that.
>I'ld put out the higher bounty on EROS, but the production
>implementation of that is a decade stale and uses an obsolete trap
>interface, so it's hardly a challenge.
>
>[Aside: I'ld rather you wait and take the Coyotos challenge. More money
>in it for both of us.:-)]
>
> > Why is it that the comparison is between a
> > "system S" built on a capability model and a native implementation
> > of S, rather than comparable application run on a capability
> > operating system, O? Of course I understand the legacy issues,
> > but if this is always the comparison then of course it will be
> > impossible to compete.
>
>If you believe this, it's time to shut down cap-talk, because we are all
>wasting our time. I don't believe this.
By "compete" I meant win. The cost of emulation in one direction will
always be there. You may be able to make it small enough to compete
in the sense that the value that you add is sufficient to gain market
share, but I believe it will always be possible for the monolithic kernel
to perform better without it's cost for emulation than your microkernel
will be able to do with it's emulation - even with "cheating" by turning
the internals of your micro-kernel into a monolith to eliminate internal
domain exchange costs.
When stated as above the analogy to the old debates about higher
level languages vs. machine (assembly) level languages comes to
mind. I view the comparison there as what shows up as the occasional
need to "cheat" by writing assembly level subroutines for the performance
critical elements of common use (e.g. vector operations for scientific
systems). Even with this occasional "cheating" (indeed partly because
of it), higher level languages dominate for programming. I hope to see
the day when micro kernel architectures dominate in much the same
way.
I personally hope that they do so with a "user" interface (recognizing
that what level a "user" is at is a matter of interpretation) that is
an object-capability interface and that is wrappable and has been mapped
into a network interoperable object-capability standard. My writing to
this list is with that goal in mind.
--Jed http://www.webstart.com/jed/
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.eros-os.org/pipermail/cap-talk/attachments/20070104/60846ed8/attachment-0001.html
More information about the cap-talk
mailing list