Revised2IPC Design

This note describes the third iteration of the EROS IPC interface, which incorporates a design for scatter/gather implementation.

1. Motivation

The first round of serious benchmarking for EROS included a file emulation benchmark. These measurements were done without a fast IPC implementation. In spite of this, file emulation proved to be the only benchmark on which the EROS performance was disappointing, but bulk data organization remains a vital operation, and some analysis leads me to be concerned that similar problems will arise in the networking implementation as well:

 BenchmarkLinux (kernel 2.0.34)EROS (Slow IPC)Notes 
 Create 0 Kbyte file74.00 us91.84 usThis is mostly due to the cost of the unoptimized IPC implementation 
 Create 10 Kbyte file105.00 us250.50 usDue to combination of slow IPC and lack of scatter/gather. 

The file system implementation must perform a bcopy operation internally to move the data from the receive buffer to its final resting place. To learn the cost of varous operations, I turned off the internal bcopy and also the internal logic to determine the destination of the bcopy (i.e. the indirection block traversal). This let me get a breakdown on the timings for various file sizes. At 10k, the breakdown was:

 Basic IPC costIndirection block processingExtra bcopyTotal 
 137.86 us34.38 us75.76 us249.00 us 

The marginal bcopy translates into memory bandwidth of 135 Mbytes/sec, which is close enough to L1 in-cache bandwidth that we can conclude the bcopy isn't broken. This also suggests that if the implementation does absolutely nothing else, we still can't afford to move the data twice (once across the boundary, once to final location) if matching Linux is even a goal. Setting aside the merits of Linux, an awful lot of programs out there manipulate files, so it's worthwhile for this to go fast.

All of which pretty well dictates a scatter/gather IPC implementation.

Side observation: the need for a file-like abstraction doesn't go away just because EROS is persistent. There remains still a need for an interchange format. Database programs probably get off the hook, but most applications do not.

2. Interface Revision for Scatter/Gather

While I plan to rearrange the x86 IPC layout again, the basic interface isn't changed in a major way by introducing scatter gather. The common case is still a contiguous buffer -- especially on the sender side -- and it makes sense to avoid validating an indirection descriptor block when none is necessary.

Revised Interface

In the IPCv2 specification, the rcv len and snd len fields of the IPC descriptor are limited to maximum values of 65536. The payload must be bounded to ensure that all of the data will stay in memory at once, and in practical terms it's not useful to make it much larger than this. At 64k transfer sizes the limiting factor is the data cache, and at anything much above 32k it's cost-acceptable for the programs to set up a shared memory buffer.

Given all of which, I plan to redefine the rcv len, rcv ptr, snd len, and snd ptr fields as follows:

  • On entry, positive values of rcv len and snd len will continue to describe a contiguous region of transmitted (received) data. In this case, the rcv ptr and snd ptr fields point to the data exactly as they do now.

  • On entry, negative values of rcv len and snd len indicate that the rcv data (respectively: snd data) pointers point to scatter (gather) descriptor block, and that the absolute value of the rcv len or snd len value is the number of bytes in the scatter (gather) descriptor block. If the length given is not a multiple of the descriptor size, the last descriptor will not be used. At most 64 descriptors can be included in the scatter/gather descriptor block.

  • On exit, rcv len will continue to hold the total number of bytes transferred.

In the short term, I may choose to implement the scatter/gather logic only in the slow (i.e. general) IPC path. If you're using this logic at all you're almost certainly dominated by transfer cost, so optimizing the associated context switch seems a bit silly.

Scatter/Gather Descriptors

The scatter (gather) descriptor blocks consist of (length, pointer) pairs:

    +-----------------+  offset 0
    | length in bytes |
    +-----------------+  offset 4
    | pointer to data |
    +-----------------+  offset 8
    | length in bytes |
    +-----------------+  offset 12
    | pointer to data |
    +-----------------+
         ...
    	    

To limit the number of pages that must be in memory for the IPC to proceed, the total number of descriptor pairs may not excede 32. This is substantially too large for comfort, as it allows up to 128 pages to be implicated on each side of a transfer. On the other hand, it appears to be necessary in support of certain applications.

Regardless of the lengths specified in the scatter/gather descriptor blocks, no more than 65536 bytes will be transferred in a single IPC operation.

It is the responsibility of the recipient to ensure that the receive scatter descriptor, if any, does not overlap the receive buffer areas. If the scatter descriptor overlaps any of the specified receive areas, no bytes will be received, and the recipient will recieve an FC_ParmLack fault code.

3. Incidental Modifications: x86 Family

In the IPCv2 specification, I changed the interface specification so that several of the IPC parameters were passed on the stack. In doing so, I failed to take advantage of a number of opportunities that this presents.

Fast IPC aside, the practice in EROS has proven to be that real capability invocations use stub procedures. On the x86, for lack of registers, these procedures build an on-stack data structure describing the IPC. This structure is in turn passed to the IPC assembly stub routine, which selectively loads some of the values into memory. Once into the kernel, values are lifted off of the stack and placed in pseudo-registers.

For receive-related parameters, the pseudo-registers are necessary and appropriate, as it is desirable to avoid revalidating the IPC block every time a recipient is invoked. For send-related parameters, the pseudo-registers are well and truly silly; either the IPC will fail and be restarted, or it will proceed and these parameters will become dead values. The send descriptor block must be validated in any case, so we might as well take these parameters from their original locations on the stack.

Given which, I now plan to reconcentrate the send-side parameters on the stack and lift them directly from there. The revised IPC register layout on the x86 is as follows:

 Old IPC InterfaceNew IPC Interface 
 ParameterOld locationParameterNew location 
 r0%eaxr0%eax 
 r1%ebxr1%ebx 
 r2%ecxr2%ecx 
 r3%edxr3%edx 
 rcv ptr12(%esp)inv type(%ebp) 
 snd len8(%esp)invoked key4(%ebp) 
 snd ptr4(%esp)snd len8(%ebp) 
 inv type(%esp)snd ptr12(%ebp) 
 rcv keys%ebp (lo)snd keys16(%ebp) 
 snd keys%esircv len (in/out)%esi 
 key info%esircv ptr (in)%edi 
 invoked key%ebprcv keys28(%ebp) [rcvKeys] 
 rcv len (in/out)%edikey info (out)%edi 

The layout shown here has the additional advantage that it does not clobber %ebp, which is useful in speeding up the IPC path.

An incidental benefit to this layout is that it allows me to realign the send/receive key fields on byte boundaries, making them a good bit easier to validate and extract.


Copyright 1998 by Jonathan Shapiro. All rights reserved. For terms of redistribution, see the GNU General Public License