Revised IPC Design

This note describes the second iteration of the EROS IPC interface. The revised IPC iterates on the first design to improve the performance of some common cases. The change itself isn't especially earth shattering, but thinking through the implementation on a register starved architecture was tricky.

1. The Original IPC

The first-round EROS IPC mechanism had the following interface:

InvocationCall, Return, or Fork + key space address of key to be invoked
Entry BlockOpcode, Slot numbers of the four keys to be sent, xmit data buffer pointer, xmit data buffer length.
Exit BlockOpcode, Slot numbers of the four key slots to be overwritten, rcv data buffer pointer, rcv data limit.

On return, the entry receive limit field is overwritten with min(send_len,rcv_len), and the 16 bit key data field of the invoked key is provided to the receiver. The send length is limited to 65535 bytes.

This interface suffers from a number of unfortunate limitations:

  • The 65535 limit is awkward. It would be preferable to permit a 65536 (or more generally some bounded power of 2) limit. On register starved architectures the send and receive length fields were regrettably limited to 16 bits.

  • In many cases the buffer wants to be accompanied by a word or two of descriptive information in addition to the opcode, such as an object descriptor. The lack of data words in addition to the buffer was forcing a marginal copy of the data buffer.

  • OS emulation was complicated by the need to contiguize (if that's a word) the data. Many UNIX and Win32 system calls consist of a few arguments plus a single buffer. The marginal copy was hurting (this is really the previous case in disguise).

  • Several performance-critical invocations, most notably of the space bank and the segment keeper, required a data buffer to transmit the values of one or two registers. Unfortunately, the overhead of validating the receive and send areas greatly increases the message transfer cost. On a 120Mhz Pentium, the cost goes from 2.4us (no buffer) to 3.5us. While this may not seem like a lot, the marginal cost is entirely memory delay, and will not scale with processor speed.

In consequence, we have redesigned the IPC interface. The new interface transmits more information for essentially the same cost. For the most part, we have achieved this by trading off the cost of field extraction for cache-resident data motion. The resulting interface is a bit slower in low frequency cases, but considerably faster in the cases that prove to matter.

2. The Revised Interface

The revised EROS IPC mechanism has the following interface:

InvocationCall, Return, or Fork + key space address of key to be invoked, number of registers sent and received.
Entry BlockFour registers to be sent, Slot numbers of the four keys to be sent, xmit data buffer pointer, xmit data buffer length.
Exit BlockFour registers to be received, Slot numbers of the four key slots to be overwritten, rcv data buffer pointer, rcv data limit.

In the new interface, the receive buffer length has been extended to a 32 bit field. While the current interface restricts the transfer to 64K, this limit could be compatibly increased in future implementations.

On register rich architectures, including most RISC machines, there is no problem with taking up more registers. While the sender must save/restore and zero any of the transmitted registers that it does not wish to transmit, there are several ameliorating considerations:

  • The save and restore are cache-resident with high likelihood.
  • Given cache residence, the marginal cost of the save and restore are balanced by a marginal savings obtained by eliminating some bitfield extraction instructions in the kernel IPC path.
  • Many performance-critical cases that previously required a data buffer no longer require one. Memory latency is bad, so this is a good tradeoff.

3. Impact on Register-Poor Architectures

The main problem with this change is the impact on register-poor architectures. Since the most widely deployed processor (x86) is register poor, this bares thinking on. Since the x86 is probably the most register impoverished architecture, we will describe the implementation for the x86 and leave it to the reader to adapt to similarly deficient architectures.

The main complication is that those invocations that do not require a data buffer should be executable from a read-only address space. While this is not the common case, there is some bootstrap code in the EROS system that must be able to run in an immutable space to support certain confinement and security arguments. This means that any values that are mutated as a side-effect of the IPC (receive length, key data) must reside in registers. In addition, the opcode and return code should reside in registers, though it is acceptable for them to be mapped to the same register.

The x86 provides only eight user-mode registers: %EAX, %EBX, %ECX, %EDX, %ESI, %EDI, %EBP, and %ESP. With the exception of %ESP, all of these registers can be modified as part of the calling convention, and even %ESP can safely be used as a general-purpose register in an immutable address space. Modifying %EBP presents a challenge to debuggers. This is extremely unfortunate, but in practice the IPC trap is a readily identifiable leaf procedure, and can be recognized by the debugger as a special case (as a recovering debugger architect, I confess that this makes me shudder).

On the x86, the hardware will copy values off of the stack on entry into the kernel when performing a privilege crossing CALL instruction. This allows us to place the send data pointer, send data length, sent key slots, receive key slots, and receive data pointer on the stack. The remaining values are mapped to registers as follows:

%EAX .. %EDXCopied registers. By convention, %EAX holds the opcode on entry and the result code on exit.
%EBPKey register numbers of invoked key and received, assembled by concatenation:
 <zeros>:InvokedKey:rk3:rk2:rk1:rk0
Note that rk0 appears at the least significant bit position. Because the received keys of the sender are not used in the fast path, this layout has performance advantages. Once the register is saved to the save area it can be shifted in place.
%ESISend key registers, assembled by concatenation:
 <zeros>:sk3:sk2:sk1:sk0
On exit, this register holds the key info field of the invoked gate key (if any) or zero.
%EDIReceive length limit on entry, actual length on exit.

The invocation type (call, return, send) is determined by the kernel entry point. The stack pointer (%ESP) points to a structure of the form:

      struct InvInfo {
        Word invType;   /* IT_CALL, IT_SEND, or IT_RECEIVE */
        Byte *sndPtr;
        Word sndLen;
        Byte *rcvPtr;
      };
      

In an immutable address space, this data structure must be suitably pre-allocated, and the stack pointer must be set to point to the instance appropriate to the invocation in question. While the sndKeys and rcvKeys fields could be combined, the cost of the marignal shift and mask makes this a wash. Better to keep it simple.

Originally, I thought it worthwhile to have a limit on the number of registers transferred or received so as to give the process at least one register that is not mutated by IPC. For the most part this is only a problem in immutable address spaces. If it proves necessary, the applicable limits should be appended to the InvInfo structure. In practice, it turned out that only the process construction/destruction code needs to do even as much as a loop structure while making calls. This loop can simply be unwound, so I decided to stick with the current interface and avoid the additional test in the IPC path.


Copyright 1998 by Jonathan Shapiro. All rights reserved. For terms of redistribution, see the GNU General Public License