Message passing

Bryan Ford baford@schirf.cs.utah.edu
Tue, 13 Dec 94 12:22:40 MST


>Mach:
>
>If I understand things, Mach implements no upper bound on message data
>size,

Correct.  (Or, at any rate, the "upper bound" is _very_ big. :-) )

>and is prepared to do essentially arbitrary scatter/gather
>logic, all implemented by the kernel.

Actually, "scatter/gather" under traditional Mach IPC is quite limited:
there is a single "primary" message buffer that always gets copied eagerly,
and it may contain pointers to out-of-line memory regions
that always get virtual-copied (using COW).
When a message is received, the "primary message buffer" part of it
always overwrites some preexisting part of the receiver's address space,
but any out-of-line regions that the message contains
cause new address ranges in the receiver's address space to be allocated:
the receiver doesn't get to choose where they go,
and they can't overwrite existing data.
So for example you can't use "scatter/gather" to direct incoming data
into circular buffers like you can in QNX or L3.

On the other hand, the presentation/interface IPC system we're working on now
will support much more flexible scatter/gather support, so for purposes of
discussion let's assume the Mach "philosophy" allows arbitrary scatter/gather.

>Mach also has both by-value and
>by-reference transfer mechanisms, and these appear to be handled in
>the message system [would this be simplified if Mach had a system-wide
>way of naming memory objects?].

If you're referring to out-of-line memory regions as "pass-by-reference",
you're right, and it's currently a big kludge.

On the other hand, if you're referring to ports, note that port passing
in Mach is basically the same as key passing in KeyKOS.
(It's currently unnecessarily complicated in certain respects,
but we're working on fixing that.)

>It looks to me like a client app can
>construct a descriptor for an arbitrarily complicated message,
>potentially including overlapping buffers (semantics of this is
>curious, and I can't find where it is specified).

No, it's not "arbitrarily complicated", but it's "complicated enough"
for our purposes. :-)

>If there really is an unbounded number of resources involved in a
>message transfer, it strikes me that this opens us up to both
>potential deadlock (which get turned into out of memory faults) and
>denial of service attacks.

You'll have to be more specific on the exact problems you expect to see.
For one thing, you keep referring to the ability to pass messages
"atomically", but you haven't defined what "atomic" means.

In KeyKOS, I presume it might mean that a message transfer can't be
interrupted by a checkpoint operation.  But that doesn't apply directly
to Mach, because Mach doesn't do checkpointing.

In more general terms, maybe you mean that a message transfer can't
be interrupted by a page fault - you seem to make the assumption that
all message resources must be pinned while they are being transferred.
That doesn't apply in Mach - page faults occur all the time during
Mach message transfers.  It's even OK if such a page fault requires
that the kernel make an upcall to an untrusted external pager in order
to fetch the page, because only the relevant thread will be blocked
waiting for that page, not the entire kernel.  If the page-in fails
for some reason, the message transfer will fail cleanly.

Also, note that, _in general_ (not all the time though), message passing
in Mach is done by first copying the message from the sender into kernel
memory, and then copying the message out from the kernel to the receiver's
memory as a separate operation.  Thus, the "send side" of a message
transfer depends only on access to the sender's address space (and the
kernel's trusted address space), and the "receive side" only depends
on the receiver's address space.  A failed page fault on one side will
not adversely affect the other side.

Note, however, that I said "in general": for specific common cases -
e.g. simple, short messages - it's possible to arrange things so you
can copy directly from sender to receiver for best performance.
Traditional Mach doesn't do this, but we did in migrating threads RPC.
However, that's only a common-case optimization: in general, the
model is that you transfer messages in two stages.

Of course, KeyKOS's "no kernel memory allocation" design philosophy
at first seems to be incompatible with two-stage message transfer,
because of the temporary message buffer required in the middle.
However, the real issue here is accountability: the appropriate
user-space process(es) must be "charged" appropriately for any kernel
memory that they use, including the kernel memory consumed by that buffer.
Mach currently doesn't handle this well, but it could do so
and still retain the two-stage message transfer model.

Similarly, the KeyKOS kernel could be adapted to support transfer of
"large" messages in a Mach-like two-stage process.  For example, the
kernel might require any process that sends a "large" message to supply
to the kernel (explicitly or implicitly) a space bank key from which
the kernel can allocate nodes.  During the "send phase", the kernel
allocates nodes from that space bank and stuffs the contents of the
message into those nodes; during the "receive phase", the kernel
unpacks the nodes and copies the contents into the receiver's address
space.  The kernel could even keep the nodes chained together and attached
to the appropriate process's domain node so that message transfer could be
cleanly interrupted by page faults and checkpoints and such.

Of course, in KeyKOS you probably wouldn't want to add something like this
because it already has other standard facilities that can be easily used
to implement transfer of large messages.  I'm just pointing out that
Mach's two-stage large message model is not necessarily at odds with the
KeyKOS full-accountability model.

>I'm also aware that scatter/gather can be important.  I don't really
>understand how this relates to IPC, and I'ld like to.  Is the
>complexity of kernel-implemented scatter/gather really outweighed by
>the advantage?

QNX and L3 are the two OS's I know of that do heavy scatter/gather;
they each claim that it's "very important" for good performance, and
give a few obvious examples like dropping message data into ring buffers.
But since both of those OS's are proprietary, and I haven't seen any
in-depth papers on this issue, it's hard to glean much hard, practical
information on just how important or valuable it is.

However, our presentation/interface IPC system is designed to be able
to cover either situation as needed: scatter/gather can be supported,
but if neither side of an IPC connection needs it, then you don't
pay the price for it.  Determining just how useful scatter/gather is
will probably be one of our future projects.  Similarly with some even
more agressive IPC optimizations we're planning to try out under the
same model.

				Bryan