Jonathan S. Shapiro
Thu, 29 May 1997 19:43:16 -0400
[Charlie -- I copied the chunk of previous mail on IPC performance to
the eros-arch list since that is of more general interest.]
> When you say "branch misprediction", do you mean prediction by the
> programmer (making the main path straight-line), or prediction by
> the hardware (which I don't think the MIPS has, and if it did, I
> wouldn't know how to influence it)?
Both. Here is a rundown:
This amounts to a code locality issue. Smart compilers, including
those from MIPS, will take dynamic branch data and reorganize the
code, but collecting such data from a kernel is quite difficult. The
next best solution is to hand structure the code. In the absence of a
reason (such as branch taken data) to reorganize the code, most
compilers will generally lay the code out according to the flow
expressed by the source code. Conditionals tend to present barriers
to code motion (not in all cases, but in enough), so the end result is
that the flow you get is the flow you wrote. This is why the goto
hacks matter; you are effectively directing the macro assembler
(i.e. the C compiler) concerning how to place the code.
The faster MIPS chips (R10000, in particular) certainly have fairly
sophisticated branch prediction mechanisms. Some generations of the
R4x00 do as well. Even the simplest chips tend to implement a static
policy, which is either
bet not-taken if positive displacement
bet taken if negative displacement
The way to take advantage of this (or at least not work against it) is
to place your code accordingly. Micro-optimizing the conditional
expressions probably doesn't lend itself to code maintainance, but
reorganizing the major blocks may be worth it. Careful documentation
is obviously essential for maintainability.
> Speaking of scatter/gather, do you think it's possible to do with
> good performance? I assume EROS doesn't do it.
Mixed reactions. First, it's worth asking why scatter/gather is
necessary. I've observed two real situations in which people seem to
think they want it. In both cases, scatter gather is the easy thing
to do, and in both cases it's the wrong thing to do if you are really
concerned about performance. The cases are large block transfers and
network data buffers.
Some observations about these applications:
First, in both cases the action is invariably between the kernel and
the application. Note that this is very cheap on all architectures
except the x86, and it's even cheap there relative to overhead of DMA
setup. If the data buffer synchronization is handled properly, such
an exchange requires no TLB flush and no scheduling decisions,
Second, scatter/gather almost always seems to be used to work around a
data placement problem for the sake of some low-level device driver
(or packet processing code); very rarely is it between two user-level
agents. The harsh reality is that if the hardware is fast enough for
the scatter/gather to help, it's fast enough that doing the copy is
the wrong decision regardless. For such devices, in my opinion, a
shared memory interface to the driver may be called for.
We are experiencing precisely this problem on our 100Mbit enet driver.
Silly shap said "hey, I got a fast kernel, let's see if I can get
away with not buffering the enet input in kernel memory." Problem is
that the card has a small input silo, with the result that one sees
massive numbers of packet overruns. We're soon to redesign this,
after which I should have some solid data for you. The problem, and
the associated data management, are quite different for routers than
for clients, but I'll certainly let everyone know what we find out.
In the extreme case, I may decide to move IP reassembly into the
kernel. I really do not want to do that if I can avoid it, but the
ethernet transmission unit is too bloody small relative to the
traffic, and doing reassembly in the kernel lets routing be handled
without ever moving the data (at all).
Again, I'll let everyone know.
Third: the exception to all of the above statements is what I would
characterize as "metadata". Consider a UNIX system call such as
'read'. This syscall must transmit the following across the privilege
syscall # (read -- order code)
<out of line buffer>
In the KeyKOS IPC this doesn't work well, because the file descriptor
has no place to go. In the EROS IPC, the payload is
The order code is simply one of the registers. This made a tremendous
difference for app code, and it actually turned out to simplify the
IPC path. We did not find, conversely, that increasing the payload
size to 64K altered things for small transfers; since the original
spec allowed a non-page-aligned payload we already had a PTE
validating loop in the path anyway.
Having said all that, here are the issues I see in scatter/gather:
1. More page faults/opportunity for same.
S for scatter descriptor
G for gather descriptor
2 * min(G.len, S.len)/pgsize + 2 for data
In principle, I could scatter/gather in 1-byte increments.
2. Temptation to increase the bound on the IPC payload, with the
result that the transfer begins to take a significant amount of time.
This is non-interruptable time, and increasing the size of your
non-interruptable blocks has significantly negative implications for
3. Complication of the simple case or loss of locality. This either
becomes an entirely separate transfer path or your common case path is
4. Temptation to make the transfer non-atomic. A significant speedup
could be realized in EROS if I permitted the data to be moved
optimistically. The impact would be that in the event of page fault a
partially transferred data buffer would be visible. I chose not to
permit this, but I could make a case either way, and I will probably
try it the other way when I retune the path this time. From a
compatibility standpoint, it is easier to go from
rcv data buffer content can be changed prior to completion
transfers are really atomic
than the other way.
Can you shed some light on what sorts of problems are motivating
scatter/gather? Such info is useful to eros-arch, but if you/Tandem
don't want it to circulate that far please feel free not to cc
eros-arch and I shall respond in kind.
> The optimal layout of parameters in registers was one of my big questions;
> I see it's not a simple question for you either. Isn't it fair, though,
> to count the performance of the stubs? Shouldn't one have fast-path
> stubs for the common cases?
It's more subtle than I first thought -- the part I found less than
obvious was that bundling values to reduce register saves is in
It's tempting, for example, to say: "I'll bundle the send keys and rcv
keys into a single register, so as to need to save one fewer
register." It turns out to be the wrong call. Here's how it impacts
1 store in caller 2 stores in caller
1 store in receiver 2 stores in receiver
1 shift in IPC path
2 mask operations in IPC path
These stores are to the stack, so with almost perfect likelihood they
are to the cache, if not, you will certainly already have stalled the
d-cache store buffer on all but the very newest chips, and the
marginal stores aren't really the problem. On the X86, due to dearth
of registers, it is actually faster NOT to use a register-based
convention for some parts of the IPC descriptor. For values that do
not change -- sent keys, rcv keys, send len, send ptr, rcv ptr -- it
proves cheaper to write these to the stack and then copy them in the
IPC path into pseudo-registers. If you hit in the cache, the
respective costs are:
1 load or load constant to scratch register if value not from
1 store per value to write to stack
1 load per value to reload from stack
1 store per value to stash in pseudo register
... 5 repetitions of above sequence ...
1 store of designated register for later reload
1 load or load constant to designated register
1 load on the way back to recover saved register
from user stack
The difference is that in the X86 the IPC requires that something be
expressed indirectly anyway for lack of registers. Once you have to
bite the bullet you might as well plan it that way from the beginning.
As to choice of registers, this depends on user-level usability and
how tightly you want to be bound to the compiler calling conventions.
It is preferable to use caller-saved registers for those values that
are read-only within the IPC path (i.e. not the 4 transfered regs, the
length, or the key info [key data] value). For a similar reason, it
is quite helpful for transferred register zero, which is by convention
the return code register, to coincide with the register used in the
calling convention to return function values.
Calling conventions, however, vary across compilers even within an
architecture, however, and even across system revisions. This may
later raise compatibility issues. I believe that the Tandem (mips)
compiler, for example, does NOT use the same calling convention as the
GNU MIPS compiler.