dope-vector based IPC interface
Jonathan S. Shapiro
shap@eros.cis.upenn.edu
Tue, 21 Sep 1999 14:41:18 -0400
What follows should be viewed as a design speculation, not as a
commitment to change the EROS kernel.
I've spent the last week getting EROS to run better in support of the
SOSP paper that will (hopefully) be appearing in December. In the
course of this, I've also had a chance to run some minor performance
tests on the Lava nucleus (a branch of L4 done at IBM). The results
are (temporarily, at least) depressing.
The problem, fundamentally, is that EROS capability invocations
transmit more "stuff" than Lava IPC's do. Some of the difference is
truly essential. If you don't have an indirection in the invocation,
then there is no place to stand to authenticate the invocation.
Recent versions of Lava are moving in this direction (see the paper on
IBM's Lava site at http://www.research.ibm.com/lava). If you use
capabilities rather than redirection tables, you need to check the
capability type. Finally, the specification of EROS's CALL
operation requires that a resume capability be generated, and in the
current design this requires that link chains be updated.
All that having been said, it's not enough to account for the current
performance difference between the two systems. On a 266Mhz Pentium
II, the EROS "null call" IPC between large spaces is currently taking
2.46us, while the equivalent Lava operation takes only 1.56us (these
are one-way times). This amounts to 239 cycles, which is far more
than the above differences should take.
One of the major differences between the two implementations is that
the Lava/L4 trap interface uses a dope vector. If no string is to be
transmitted, then the dope vector register pointer is empty and the
trap interface can skip the work necessary to do a bounds check on the
string buffer and move the data.
In older versions of EROS (including the one presented at IWOOOS) this
was unnecessary. That system had a smaller capability register set
(16 vs 32), a smaller string bound (4096 vs 65536), and fewer
transmitted registers (1 vs 4). Collectively, these difference allowed
the entire IPC description to fit in the x86 register set (albeit
barely). In more recent versions of EROS this is no longer true.
I am therefore contemplating moving to a dope-vector based design.
Here are some observations about dynamic behavior:
In terms of dynamic frequency, SEND (KeyKOS FORK) operations might
as well not exist, so I'll ignore them here.
Of the rest, CALL and RETURN invocations have roughly equal
frequency.
All invocations specify a key to invoke. This should be
registerized in the trap interface.
Aside from the resume capability associated with the CALL
instruction, most invocations neither transmit nor accept
capabilities. The resume capability destination should either be
registerized or must default to some conventional register; I am
presnetly inclined toward the latter.
Of the invocations that move data, most are one way. That is,
either a string is transmitted and a status returned or a request
is transmitted and a string is returned. It is not clear if this
can be leveraged.
It remains desirable to have an interface that can be used by a
process whose address space is read-only, but I no longer consider
this compelling given the costs above. Process creation is
relatively low frequency event, and more than fast enough in the
current design. If the read-only ability is retained, then the
keyData field and the rcvLen field must be received to registers.
The available registers at the trap interface are:
%EAX, %EBX, %ECX, %EDX
%ESI, %EDI
%EBP -- but only under extreme duress!
Given all of which, here was my first thought for a dope-vector-based
interface:
dope vector (32 bit):
OFFSET DESCRIPTION
0 xmit capabilities
4 rcv capabilities
8 xmit string bound
12 xmit string ptr
16 rcv string bound
20 rcv string ptr
24 xmit word 2
28 xmit word 3
as far as I can tell, order is not significant
register uses ('in' and 'out' as seen by kernel)
REGISTER IN OUT Note
%EAX xmit word 0 rcv word 0 Order/Result code
%EBX xmit word 1 rcv word 1
%ECX dopevec ptr rcv word 2
%EDX invoked cap rcv word 3
%ESI <unused> key data
%EDI <unused> true rcv len # of bytes received
Invocation type is determined from the trap type. In the current
system a single trap is used, in the proposed design distinct
traps are used for call/return send. The invocation type is
therefore implicitly encoded in the trap number.
On the x86, once you are saving 4 registers you might as well use
PUSHA and save them all. Given this, staring at the two unused
registers got me to thinking, and I concluded tha they should be used
as follows:
REGISTER IN OUT Note
%ESI xmit string ptr key data
%EDI xmit string bound true rcv len # of bytes received
the corresponding entries are removed from the dope vector, giving:
OFFSET DESCRIPTION
0 xmit capabilities
4 rcv capabilities
8 rcv string bound
12 rcv string ptr
16 xmit word 2
20 xmit word 3
Rationale:
1. If you are receiving a string at all, you are not running in a
read-only address space. Given this, it is possible to specify the
receive-string in a dope vector.
2. In practice, the receive buffer often has constant length and
location, while the transmitted string (if any) does not. Keepign the
xmit string information in the dope vector will force the dope vector
to be updated in more cased.
3. The registers are clobbered anyway, so there's no point to NOT
using them.
IF THE DOPE VECTOR POINTER IS ZERO, it's values are taken to be:
ELEMENT VALUE MEANING
xmit capabilities 0x0 (none)
rcv capabilities 0x1f000000 key 3 to last cap register
rcv string bound 0x0 no string accepted
rcv string ptr 0x0
xmit word 2 0x0
xmit word 3 0x0
The purpose of having the rcv capabilities value be as stated is to
allow a resume capability to be accepted by a process that does not
supply a dope vector. This design prefers capability register 31 as
the resume capability register, but in practice most current
applications seem to prefer that in any case.
A good many IPC cases dshould now be doable without any dope vector at
all, and a good many more using a dope vector that is constant.
FINE PRINT:
If having a register preserved across the trap is useful,
we can assign meaning to the least-order bit of the dope vector
pointer. A value of '1' would signal that the key data field is
desired and should overwrite %ESI. Note that when the xmit string
bound is zero the xmit string is uninterpreted.
MORE FINE PRINT:
There is the usual semantic question about order of operation here.
If a process performs an invocation, and the memory area containing
the dope vector is modified before the next request comes in, which of
the values for the 'rcv' fields prevail -- the ones before or the ones
after?
The answer in the proposed design is that the ones BEFORE prevail.
That is, we consider this a situation in which the processor has
logically sourced all operands from memory and moved them (as needed)
to scratch registers. The fixed-point register get/set structure will
include these values and allow them to be modified, much as it does
for the current pseudo registers.