EROS retrospective -- Device memory interface
shapj@us.ibm.com
shapj@us.ibm.com
Sat, 25 Dec 1999 17:41:32 -0500
In KeyKOS, devices have one holder, and when large buffers are needed the
device driver presents a node containing pages to the device. These pages
are pinned by the device, and can be used for (e.g.) packet transfer. An
implication is that holding such device keys implies the authority to pin a
limited amount of storage.
In thinking about modern network architectures, I have come to the
conclusion that this is insufficient. First, there is no inherent reason
to believe that a single node of pages is sufficient. Expanding the size is
easily done on a per-device basis, so this should be considered a nit.
More compelling, however, is the problem that the memory contract between
the low-level driver and the upper-level driver is not a private contract
in modern networking implementations. In support of copy suppression, it
is now increasingly common for the upper-level driver to directly share
mappings into the packet buffer with the recipient processes, and for much
of the traditional network processing to be moved up the food chain into
the client. An example of a design to support this can be seen in
Zwaenopol et al.'s "I/O Lite" system, where data streams in general are
implemented using memory sharing techniques, and the shared buffer manager
selectively provides mappings to clients that are consistent with their
protection domains.
Sharing all the way to the client means that the lower-level driver cannot
place it's data at any likely location in the buffer, because there are
protection implications in the choice of placement. An enhanced interface
would provide the lowest-level driver with a packet filter. The filter
would embed knowledge of the protection contracts, and would be called by
the lowest-level driver to make the placement decision. The filter would
be written in a safe langauge and compiled for kernel
execution/interpretation by a trusted agent. Considering that the
complexity of the demultiplexing decision at this level is small, it is
reasonable to consider simply downloading such a filter into the kernel.
It also means that we must begin to consider the implications of checkpoint
inconsistency. KeyKOS/EROS generally assume that the behavior of device
drivers with respect to checkpointing is "outside the contract." This is
tolerable as long as the upper/lower driver contract is not exposed, but
becomes insupportable if other programs begin to see the buffer directly.
Consider an application TCP stack that wakes up after checkpoint believing
it is in a well-defined interim state in processing a packed only to
discover that the underlying buffer was excluded from the checkpoint and is
therefore stale.
Shadow pages suggest themselves as a means to resolve this, but at some
point we must ask whether double buffering by all drivers is worth it. Can
anyone suggest a criterion for deciding? Note that it's an issue both for
input devices and for output devices. What matters is that the buffer gets
modified, not who modifies it.
A possible design is to introduce a means of obtaining "real" pages, and
ensuring that their mappings disappear suitably on restart. The TCP stack
can then handle this with suitable exception handling logic. Opinions?
Jonathan S. Shapiro, Ph. D.
Research Staff Member
IBM T.J. Watson Research Center
Email: shapj@us.ibm.com
Phone: +1 914 784 7085 (Tieline: 863)
Fax: +1 914 784 6576