IPC performance
Jonathan S. Shapiro
shap@eros.cis.upenn.edu
Thu, 29 May 1997 16:24:12 -0400
[ Following is a segment of mail I sent to Charlie Landau concerning issues
in IPC performance. This portion may have more general interest. The
context is that we're starting to think about the IPC path for
the MIPS chip.]
Here's quick summary of my recollections about the optimizations [in
the C++ path], which I will expand on by a tech note with the code:
The saving grace on all this is that there just aren't that many
choices in the IPC path -- unless the IPC has gobs of options the
control-flow is of the form:
check all the reasons why this won't work and bail
do any inline options (e.g. in EROS substituting keeper key
for red seg key or DK(0) for key to malformed domain
verify send/rcv buffers (assumes send/rcv keys/ports are in
kernel memory and not pageable, or are in core exactly
if process is in core)
last chance -- verify that you didn't zap any PTE's on the way
to validating others (e.g. effects of core pressure).
not always necessary
COMMIT
proceed with xfer
Cache localization is relatively immune to # cache sets. Fine-grain
analysis is really only necessary if you are stuck with a 1-set
D-cache. The whole path will fit comfortably in a modern 1-set
I-cache.
The critical optimizations are as follows:
I-cache residency
Data contiguity -- get all the kernel variables referenced to
be adjacent. There shouldn't be many.
address validity probe -- it pays to walk the hardware tables
(if any) in SW first on X86, and only then appeal to
the official mapping structures for pagein. The rcv
and send buffers are usually valid and in-core.
eschew branch misprediction
gather all the code to adjacent addresses by controlling order
of object files on link line (I-cache conflicts)
The rest is interrupt handler tricks (in which, FYI, writing the code
to take into account delay slots can be challenging and matters a
lot. It's harder on X86 due to lack of enough scratch registers).
The big item in I-cache residency is careful use of the INLINE
compiler directive and recognizing that all the failure cases are low
likelihood or expensive in any event. Unfortunately, C lacks a way to
annotate this, so compilers cannot organize code accordingly. One
unfortunately therefore ends up with code of the form:
if (abort_condition_1)
goto handle_abort_condition_1
if (unusual_inline_condition_1)
goto handle_unusual_inline_condition_1
return_from_unusual_inline_condition_1:
This amounts to using C as an assembler. It's ugly as sin, but makes
a factor of 5-10 difference in I-cache misses, which definitely adds
up. It's especially bad in C++, because such gotos are not permitted
to span variable instantiations, which perverts things. In the face
of constructors this can cause unneeded computation to get hoisted
into the main path due to language restrictions. In your situation I
would strongly advise use of C for such mission-critical code.
Ugly things like this got my earlier C++ path from 45+us down to
6.9us, where the hand path gets about 2.4us. Numbers for the current
one will be different, since it gets more stuff right :-) I actually
found that given the inlining feature it was NOT necessary to alter
the (reasonably) nicely decomposed functions I had written in this
path.
Finally, the common-case key encodings (send no keys, rcv only rsm
key) were worth special-casing -- mispredicted branches hurt a lot.
Note that this particular optimization did not interact well with
inlining, because the goto tricks above do not inline expand in the
desired ways. I did not bother to fix this case, as my I-cache miss
rate was already zero in the benchmark.
If you are *not* doing an object-style interface to the kernel it
simplifies things a lot, because you know that the inbound stuff and
the outbound stuff are going to distinct places (i.e. no worry about
permute keys operation), so you don't have to work around overlap
issues. It's reasonable to say that if the inbound data buffer and
the outbound data buffer overlap the user gets scrambled eggs and
tough on them. If it's undefined in the first implementation you can
safely define it in the second round if it's a problem, but getting
this right is a pain in the butt, especially if doing scatter/gather.