Object Servers in Dionysix

Revision 0.2

1. Introduction

One of the more challenging issues in the design of Dionysix has been deciding what the mechanism should be for handling files, pipes, devices and the like. The principle difficulty lies in the need to support asynchronous signalling, which EROS really isn't designed to support.

For several months, we adopted the view that since I/O interruption is rare, the basic design should be synchronous. This was a dead end. Eventually, we settled on the asynchronous interface described here. The interface is designed to allow non-blocking I/O to complete with a single CALL/RETURN in the usual case, but supports both early abort of I/O's and callbacks for blocking I/O. The design grew out of the 9P protocol of Plan 9 (see: The Use of Name Spaces in Plan 9 and the Plan 9 Reference Manual).

9P lacks overt support for three features of UNIX:

  1. There is no support for any equivalent to a superuser who has the authority to act as any other user.
  2. There is no support for the setuid mechanism, which allows a program to operate with authorities different from the person running it.
  3. There is no direct support for a client program that operates with the simultaneous authority of two (or more) users (as in the setuid/setgid system calls).

The essential reason for these omissions is that the 9p server does not trust the client.

The omission of the superuser is probably a good thing. The UNIX superuser carries far more authority than is good for it, and as a result is often abused.

The omission of the setuid bit and the setuid/setgid mechanism is more problematic. The setuid bit is frequently used to ensure that generally accessable objects are accessed only through programs that maintain their consistency. In Plan 9, such enforcement policies can only be implemented by file servers, while we would like them to be implementable by users in general. For example, it is unclear to us how one would implement a revision control mechanism such as RCS as a general user.

Similarly, the omission of any mechanism for union authorities is significant. In spite of the issues raised by The Confused Deputy, it is sometimes useful to act with multiple authorities. A clean mechanism for doing so seems too useful to give up.

Given these problems, we have adopted a subset of the 9p protocol, dp (for Dionysix Protocol), which retains the facilities of UNIX within a protocol similar to 9p. This is the protocol used in the EROS POSIX environment. dp retains some important features of 9p:

  • dp can be implemented with a fixed amount of server memory per connection.
  • dp is asynchronous. While most I/O transactions can be completed in a single EROS call and return, the protocol does not require synchronous RPC.
Differences between dp and 9p include:
  • A single dp session can act on behalf of multiple authorities.
  • The dp server (at least in Version 0) trusts the client rather more than it does in 9p.
  • dp provides a mechanism for early abort of an I/O operation, the Tabort request. This is necessary in support of POSIX signals and other asynchronous signalling mechanisms.
These differences enable support of the full POSIX semantics on the client, provided that the client is suitably authorized to support setuid.

The version of the protocol described here is preliminary, and we expect will soon be discarded. In particular, we are disastisfied with the need to trust the client, and believe that this is basically a bad idea. Still, this version of the protocol is sufficient to get us started.

2. The Protocol

A dp client holds a capability to the dp server. The capability encapsulates a session. Given this, there is no real need for session level authentication as in 9p. Posession of a key to the server is a necessary and sufficient condition to speak to that server. The design assumes that distinct clients will hold distinct keys to the server.

Clients and Servers communicate using a byte-stream protocol rather than the EROS call/return model. This allows the server to operate asynchronously. Request and responses are encoded into a sequence of bytes, and sent to the server using the call/return mechanism as a transport.

Each time the client invokes the object server, it sends an EROS message of the form:

    W Size of response buffer (must be greater than or equal to 4)
    B* Server requests, expressed as a stream of bytes.
The server, in its turn, responds with a message of the form:
    W Number of request bytes accepted
    B* Server responses, expressed as a byte stream.

Requests are byte-formatted sequences having the following components:

    request A 16 bit field indicating the request.
    tag A 16 bit field whose content is specified by the client, and will be contained in the reply message from the server.
    arguments Argument specific to a particular request.

Each transmitted data buffer should be thought of as appended to the stream of messages already sent. Depending on how much space the server has available, it may elect to receive less than the full transmission. In this event, it is the client's responsibility to drain the response buffer and try again.

Most arguments have a fixed length, which is indicated next to the name of the argument in the specification. Some arguments are shown with the notation [*N], indicating a variable-length argument of length less than or equal to N. In the byte stream, such arguments are preceded by their byte count. If the maximum will fit within a single byte, the length is encoded in a single byte. Where the maximum will not fit within a single byte, it is encoded as a two byte size.

All numbers are encoded in the byte stream in little endian format, regardless of the encoding of the native hardware.

Most object servers respond to requests promptly. Some object types do not guarantee that reads and writes will complete without blocking. Read and write requests on such objects will eventually complete, but no timing guarantees are made by the server. A mechanism is provided to allow the server to signal completion of blocking I/O.

3 The Transport

The remainder of this note describes the protocol between the object server and the client. It describes how to set up and tear down a connection, the transmit and receive mechanisms, and the formats of each individual message.

3.1 Connection Setup and Teardown

The first step in establishing communication between client and server is for the client to invoke the server with a connect invocation. The connection is terminated by the disconnect invocation.

Connect (OC = 1)

Initializes a connection between client and object server.
Reply W If nonzero, indicates that the server implements asynchronous I/O. The client should be prepared to perform the IOWait invocation to learn when such I/O is completed, or arrange to do periodic transfers.
Result RC_OK Connection is established

Disconnect (OC = 2)

Causes the connection to be torn down. The client's key to the server is rescinded.
Result RC_OK Connection is established

3.1 The Byte Stream

Requests and responses between client and server are transmitted as a sequence of byte-formatted messages. The request and response byte streams are transmitted between client and server using the call and return mechanisms as a transport.

The basic invocation of this protocol is the transfer invocation, which moves bytes back and forth between client and server.

Transfer (OC = 3)

Transmits requests to the server and collects responses. The server may not accept all of the transmitted byte string, but is guaranteed to accept an integral number (possibly 0) of requests.
Request W The space available in the response buffer. It is preferable for this be as large as possible. If the receive buffer size provided is less than 4, no requests will be accepted by the server.
B* Server requests, formatted as a byte string.
Response W The number of bytes accepted from the sent string. The server will accept requests only if there is available space in the output buffer to place the results, so a failure to accept requests generally means that the output buffer should be drained and the request should be reattempted.
B* Server replies, formatted as a byte string.
Result RC_OK Transmission succeeded, reply has emptied the response buffer.
RC_OK+1 Transmission succeeded, response buffer contained more responses than would fit in the response buffer provided by the client. Client should transmit again (typically with a null request string) to collect remaining responses after processing the responses received.
RC_OK+1 Transmission succeeded, response buffer contained more responses than would fit in the response buffer provided by the client. Client should transmit again (typically with a null request string) to collect remaining responses after processing the responses received.
KT+2 Request format error. Either the caller did not provide for a returned acceptance count, or the transmitted message contained a malformed message sequence.

IoWait (OC = 4)

Blocks until an asynchronous I/O completes and the client is not expecting a response. Used to indicate that the client should perform a transfer operation to obtain the result of the asynchronous I/O.

The usual mechanism for handling asynchronous I/O is for the client to establish a helper domain. The helper domain loops performing an IoWait invocation and then calling the client to inform it that a transfer operation should be performed.
Result RC_OK An asynchronous I/O has completed.
KT+1 Connection has been torn down.

The use of the IoWait invocation is optional. Clients may use periodic wakeup to accomplish the same result without an additional domain.

4 The Protocol

Each dp message consists of a sequence of bytes containing the request number, a client-supplied tag, and the arguments to the request. Each message is described in detail below. One area that requires further specification is the possible error values that can be returned by reply messages.

Every message has a corresponding reply message. The reply message has a message number that is one higher than the request number, and contains the same tag as was specified by the client in the request. The tag, in effect, acts as a sequence number for outstanding requests. A message will either receive a corresponding reply with the identical tag or an Rerror response indicating that the request failed.

(0) Rerror tag[2] err[2]
This response is issued by the server to indicate that an error occurred on the request with the corresponding tag. This message replaces the normal response message. err is an integer (values to be defined) indicating the error that occurred. ename is a text representation of the corresponding error number

(1) Tattach tag[2] cid[2] credentials[8] aname[*255]

Requests that the object server bind the root of the hierarchy named by aname to the given cid.

(2) Rattach tag[2] cid[2] uoid[8]

Response to the Tattach message. The returned cid should be the same as the cid named by the client. The uoid is an 8-bit field that is unique among all objects in this hierarchy.

(3) Tclone tag[2] oldcid[2] newcid[2]

Requests that the object server make a duplicate of oldcid in newcid. This message is usually issued prior to performing a walk operation.

(4) Rclone tag[2] cid[2]

Response to the Tclone message. The returned cid should be the same as the newcid named by the client.

(5) Twalk tag[2] cid[2] name[*255]

Performs a directory traversal. The cid supplied by the client must identify a directory. On completion, it identifies the component object named by name (if any).

(6) Rwalk tag[2] cid[2] uoid[8]

Response to the Twalk message. The returned cid should be the same as the cid named by the client. The uoid is an 8-bit field that is unique among all objects in this hierarchy.

(7) Topen tag[2] cid[2] mode[1]

Opens the named cid for reading and/or writing, according to mode. Legal values for mode are:

    0 Read access
    1 Write access
    2 Read and write access
    3 Execute access
In addition, the following values can be or'd with the supplied mode, causing the indicated actions:
    0x10 Truncate. If the file is not append only, and the sender of the message has write permissions (in version 0 this is always true), the file is truncated. If the file is append-only, the open will succeed but the file will not be truncated.
    0x40 Remove on close. File should be removed when the fid is clunked, which requires write permission on the directory.

(8) Ropen tag[2] cid[2] uoid[8]

Response to the Topen message. The returned cid should be the same as the cid named by the client. The uoid is an 8-bit field that is unique among all objects in this hierarchy.

(9) Tcreate tag[2] cid[2] perm[4] mode[1] name[*255]

Creates a new object or directory in the directory named by cid, opening it according to the mode specified by mode. for reading and/or writing, according to mode. Valid modes are as described in the open message.

The mode word indicates whether an object or directory is to be created and the permissions to be associated with the newly created object. The file mode is composed of the or of the following values:

    0x1 Others (not user or group) can execute this file.
    0x2 Others (not user or group) can write this file.
    0x4 Others (not user or group) can read this file.
    0x8 Members of the file's group (including the owner) can execute this file.
    0x10 Members of the file's group (including the owner) can write this file.
    0x20 Members of the file's group (including the owner) can read this file.
    0x40 The file's owner can execute this file.
    0x80 The file's owner can write this file.
    0x100 The file's owner can read this file.
    0x2000000 Object is "sticky".
    0x4000000 Object is setgid.
    0x8000000 Object is setuid.
    0x20000000 Created object requires exclusive access -- only one user can open the object at a time.
    0x40000000 Created object is append only.
    0x80000000 Created object is a directory.

Objects are initially created with the same group as their containing directory, and the user implied by the cid. If desired, the group can subsequently be changed via the wstat message.

(10) Rcreate tag[2] cid[2] uoid[8]

Response to the Tcreate message. The returned cid should be the same as the cid named by the client. The uoid is an 8-bit field that is unique among all objects in this hierarchy.

(11) Tread tag[2] cid[2] offset[8] count[2]

Perform a read operation. The count must be <= 8192. The number of bytes returned may be less than the number requested if end of file is encountered during the read or the read is aborted.

(12) Rread tag[2] cid[2] eof[2] pad[1] data[*8192]

Response to the Tread request. The actual message will be padded to a two-byte boundary. If end of file was encountered, the eof field will be nonzero.

(13) Twrite tag[2] cid[2] offset[8] data[*8192]

Perform a write operation. The count must be <= 8192. The number of bytes written may be less than the number of requested if the write is aborted by an abort message or the associated connection is lost (e.g. user hangs up the phone).

(14) Rwrite tag[2] cid[2] eof[2] count[2]

Response to the Twrite request. The actual message will be padded to a two-byte boundary. If end of file was encountered, the eof field will be nonzero.

(15) Tclunk tag[2] cid[2]

Tells the server that the indicated cid will no longer be used.

(16) Rclunk tag[2] cid[2]

Reassures the client that the server has in fact forgotten about cid.

(17) Tremove tag[2] cid[2]

Tells the server that the object named by cid should be removed. This requires write permission on the directory containing cid. Whether or not the object is removed, the remove operation implies a clunk of fid.

(18) Rremove tag[2] cid[2]

Informs the client that cid has indeed been clunked (and possibly removed).

(19) Tstat tag[2] cid[2]

Requests per-object information from the server.

(20) Rstat tag[2] cid[2] stat[44]

Returns the state structure of the object named by cid. The state structure contains the following information:

    uoid[8] The server's unique identifier for the object named by cid.
    size[8] The size in bytes of the object named by cid.
    mode[4] The mode field of the object named by cid.
    uid[4] The user identifier of the owner of the object identified by cid.
    gid[4] The group identifier of the group of the object identified by cid.
    atime[4] The last access time of the object named by cid.
    mtime[4] The last modification time of the object named by cid.
    ctime[4] Time of last change of the object named by cid.
    type[2] The type of the object named by cid. Not interpreted by the server.
    minor[2] The "minor number" of the object named by cid. Not interpreted by the server.

(21) Twstat tag[2] cid[2] stat[44]

Rewrites some of the fields of the object's state structure. The atime, mtime, and directory bit of the mode field cannot be revised.

(22) Rwstat tag[2] cid[2]

Indicates that the object's state structure has been updated.

(23) Tflush tag[2] oldtag[2]

Causes the previous request named by oldtag to be canceled if it has not already completed. Indicates that the client does not wish to receive a reply for the associated operation, and will ignore any replies that may already have been generated. Flush operations are always processed promptly.

(24) Rflush tag[2]

Indicates that the flush operation has completed.

(25) Tabort tag[2] oldtag[2]

Causes the previous read or write request named by oldtag to be aborted prematurely if it has not already completed. If the operation is still in progress, it will be terminated, and an Rwrite or Rread indicating a short read will be generated. The effect of an abort operation is always achieved promptly.

(26) Rabort tag[2]

Indicates that the abort operation has completed.

(27) Tcredclone tag[2] oldcid[2] credentials[8] newcid[2]

A clone of the object identified by oldcid is created, having the associated credentials, and named newcid. The new cid is not open, and does not necessarily convey authority sufficient to let the new user do anything with it.

(28) Rcredclone tag[2] cid[2]

Response to the Rcredclone message. The returned cid should be the same as the newcid named by the client.

5 Usage Notes and Issues

One issue that is not overtly addressed by this protocol is the implementation of multiple groups. The dp open message checks only one user and group at a time.

The effect of multiple groups can be achieved by issuing a series of clone and open messages, passing different credentials for each open message. The file may end up multiply open. The server can then clunk the cids that it doesn't use.

The current protocol does not support hard links. Hard links are a bad idea, inducing a rich variety of puzzling behaviors for file system consistency checkers, and I don't really want to support them if I can avoid it. It may be unavoidable.

5.1 Pipes

Another issue that is not obviously addressed is how to deal with pipes and other objects that are really one objects with multiple connection points. The general idea is that the pipe server simply serves up an infinite number of pipe hierarchies that are all named clone. Opening a clone file has the effect of implicitly creating a new hierarchy rather than opening the existing one. Each hierarchy is a directory supporting two names (e.g. 0 and 1) corresponding to the end points of the pipe.

Fill me in!

6 Background

The design of the Dionysix name space mechanism has proven challenging. A number of options were debated along the way:

  • One domain per disk file.
  • One domain per open disk file.
  • One domain per file system.

The "one domain per file" approach was used by KeyNIX (see: The KeyKOS Nanokernel® Architecture). While this approach is probably simplest, each file domain requires at least one page for a stack. On my linux system at home, the breakdown of files is as follows:

Count Description Percentage
31,071 All files 100.0%
18,789 Files <= 4096 bytes 60.5%
13,134 Files <= 2048 bytes 42.3%
5,301 Files <= 1024 bytes 17.0%
Given this, the one domain per file design seems prohibitive.

The "one domain per open file" approach is more promising. The principle problem with this approach lies in the need for reference counting. If a UNIX server manages to die without closing its currently open files, the open file objects could last forever. In the end, these dangling open files represent a serious challenge for robustness.

The final design is the one we have adopted. File systems should not go away when the referencing UNIX server dies, so the dangling reference problem does not arise. In addition, systems like Plan 9 and NFS have adopted this approach successfully, so there is some reason to be optimistic that it will work correctly.


Copyright 1998 by Jonathan Shapiro. All rights reserved. For terms of redistribution, see the GNU General Public License