dope-vector based IPC interface

Jonathan S. Shapiro shap@eros.cis.upenn.edu
Tue, 21 Sep 1999 14:41:18 -0400


What follows should be viewed as a design speculation, not as a
commitment to change the EROS kernel.


I've spent the last week getting EROS to run better in support of the
SOSP paper that will (hopefully) be appearing in December.  In the
course of this, I've also had a chance to run some minor performance
tests on the Lava nucleus (a branch of L4 done at IBM).  The results
are (temporarily, at least) depressing.

The problem, fundamentally, is that EROS capability invocations
transmit more "stuff" than Lava IPC's do. Some of the difference is
truly essential.  If you don't have an indirection in the invocation,
then there is no place to stand to authenticate the invocation.
Recent versions of Lava are moving in this direction (see the paper on
IBM's Lava site at http://www.research.ibm.com/lava).  If you use
capabilities rather than redirection tables, you need to check the
capability type.  Finally, the specification of EROS's CALL
operation requires that a resume capability be generated, and in the
current design this requires that link chains be updated.

All that having been said, it's not enough to account for the current
performance difference between the two systems.  On a 266Mhz Pentium
II, the EROS "null call" IPC between large spaces is currently taking
2.46us, while the equivalent Lava operation takes only 1.56us (these
are one-way times).  This amounts to 239 cycles, which is far more
than the above differences should take.

One of the major differences between the two implementations is that
the Lava/L4 trap interface uses a dope vector.  If no string is to be
transmitted, then the dope vector register pointer is empty and the
trap interface can skip the work necessary to do a bounds check on the
string buffer and move the data.

In older versions of EROS (including the one presented at IWOOOS) this
was unnecessary.  That system had a smaller capability register set
(16 vs 32), a smaller string bound (4096 vs 65536), and fewer
transmitted registers (1 vs 4). Collectively, these difference allowed
the entire IPC description to fit in the x86 register set (albeit
barely). In more recent versions of EROS this is no longer true.

I am therefore contemplating moving to a dope-vector based design.


Here are some observations about dynamic behavior:

    In terms of dynamic frequency, SEND (KeyKOS FORK) operations might
    as well not exist, so I'll ignore them here.  

    Of the rest, CALL and RETURN invocations have roughly equal
    frequency.

    All invocations specify a key to invoke.  This should be
    registerized in the trap interface.

    Aside from the resume capability associated with the CALL
    instruction, most invocations neither transmit nor accept
    capabilities.  The resume capability destination should either be
    registerized or must default to some conventional register; I am
    presnetly inclined toward the latter.

    Of the invocations that move data, most are one way.  That is,
    either a string is transmitted and a status returned or a request
    is transmitted and a string is returned.  It is not clear if this
    can be leveraged.

    It remains desirable to have an interface that can be used by a
    process whose address space is read-only, but I no longer consider
    this compelling given the costs above.  Process creation is
    relatively low frequency event, and more than fast enough in the
    current design.  If the read-only ability is retained, then the
    keyData field and the rcvLen field must be received to registers.

The available registers at the trap interface are:

    %EAX, %EBX, %ECX, %EDX
    %ESI, %EDI
    %EBP -- but only under extreme duress!


Given all of which, here was my first thought for a dope-vector-based
interface:

    dope vector (32 bit):

       OFFSET      DESCRIPTION
       0           xmit capabilities
       4           rcv capabilities
       8           xmit string bound
       12          xmit string ptr
       16          rcv string bound
       20          rcv string ptr
       24          xmit word 2
       28          xmit word 3

       as far as I can tell, order is not significant

    register uses ('in' and 'out' as seen by kernel)

       REGISTER     IN                 OUT           Note
       %EAX	    xmit word 0        rcv word 0    Order/Result code
       %EBX	    xmit word 1        rcv word 1
       %ECX	    dopevec ptr        rcv word 2
       %EDX	    invoked cap        rcv word 3
       %ESI         <unused>           key data
       %EDI         <unused>           true rcv len  # of bytes received

    Invocation type is determined from the trap type.  In the current
    system a single trap is used, in the proposed design distinct
    traps are used for call/return send.  The invocation type is
    therefore implicitly encoded in the trap number.


On the x86, once you are saving 4 registers you might as well use
PUSHA and save them all.  Given this, staring at the two unused
registers got me to thinking, and I concluded tha they should be used
as follows:

       REGISTER     IN                 OUT           Note
       %ESI         xmit string ptr    key data
       %EDI         xmit string bound  true rcv len  # of bytes received

the corresponding entries are removed from the dope vector, giving:

       OFFSET      DESCRIPTION
       0           xmit capabilities
       4           rcv capabilities
       8           rcv string bound
       12          rcv string ptr
       16          xmit word 2
       20          xmit word 3


Rationale:

1. If you are receiving a string at all, you are not running in a
read-only address space.  Given this, it is possible to specify the
receive-string in a dope vector.

2. In practice, the receive buffer often has constant length and
location, while the transmitted string (if any) does not.  Keepign the
xmit string information in the dope vector will force the dope vector
to be updated in more cased.

3. The registers are clobbered anyway, so there's no point to NOT
using them.

IF THE DOPE VECTOR POINTER IS ZERO, it's values are taken to be:

       ELEMENT              VALUE       MEANING
       xmit capabilities    0x0         (none)
       rcv capabilities     0x1f000000  key 3 to last cap register
       rcv string bound     0x0		no string accepted
       rcv string ptr       0x0
       xmit word 2          0x0
       xmit word 3          0x0

The purpose of having the rcv capabilities value be as stated is to
allow a resume capability to be accepted by a process that does not
supply a dope vector.  This design prefers capability register 31 as
the resume capability register, but in practice most current
applications seem to prefer that in any case.

A good many IPC cases dshould now be doable without any dope vector at
all, and a good many more using a dope vector that is constant.

FINE PRINT:

If having a register preserved across the trap is useful,
we can assign meaning to the least-order bit of the dope vector
pointer. A value of '1' would signal that the key data field is
desired and should overwrite %ESI.  Note that when the xmit string
bound is zero the xmit string is uninterpreted.


MORE FINE PRINT:

There is the usual semantic question about order of operation here.
If a process performs an invocation, and the memory area containing
the dope vector is modified before the next request comes in, which of
the values for the 'rcv' fields prevail -- the ones before or the ones
after?

The answer in the proposed design is that the ones BEFORE prevail.
That is, we consider this a situation in which the processor has
logically sourced all operands from memory and moved them (as needed)
to scratch registers.  The fixed-point register get/set structure will
include these values and allow them to be modified, much as it does
for the current pseudo registers.