| |
Object Servers in Dionysix
Revision 0.2
1. Introduction
One of the more challenging issues in the design of Dionysix has
been deciding what the mechanism should be for handling files, pipes,
devices and the like. The principle difficulty lies in the need to
support asynchronous signalling, which EROS really isn't designed to
support.
For several months, we adopted the view that since I/O
interruption is rare, the basic design should be synchronous. This
was a dead end. Eventually, we settled on the asynchronous interface
described here. The interface is designed to allow non-blocking I/O
to complete with a single CALL/RETURN in the usual case, but supports
both early abort of I/O's and callbacks for blocking I/O. The design
grew out of the 9P protocol of Plan 9 (see: The Use of
Name Spaces in Plan 9 and the Plan 9 Reference
Manual).
9P lacks overt support for three features of UNIX:
- There is no support for any equivalent to a
superuser who has the authority to act as any
other user.
- There is no support for the setuid mechanism,
which allows a program to operate with authorities different
from the person running it.
- There is no direct support for a client program that operates
with the simultaneous authority of two (or more) users (as in
the setuid/setgid system calls).
The essential reason for these omissions is that the 9p server
does not trust the client.
The omission of the superuser is probably a good thing. The UNIX
superuser carries far more authority than is good for it, and as a
result is often abused.
The omission of the setuid bit and the setuid/setgid mechanism is
more problematic. The setuid bit is frequently used to ensure that
generally accessable objects are accessed only through programs that
maintain their consistency. In Plan 9, such enforcement policies can
only be implemented by file servers, while we would like them to be
implementable by users in general. For example, it is unclear to us
how one would implement a revision control mechanism such as RCS as a
general user.
Similarly, the omission of any mechanism for union authorities is
significant. In spite of the issues raised by The
Confused Deputy, it is sometimes useful to act with multiple
authorities. A clean mechanism for doing so seems too useful to give
up.
Given these problems, we have adopted a subset of the
9p protocol, dp (for
Dionysix Protocol), which retains
the facilities of UNIX within a protocol similar to
9p. This is the protocol used in the EROS POSIX
environment. dp retains some important features of
9p:
- dp can be implemented with a fixed amount of
server memory per connection.
- dp is asynchronous. While most I/O
transactions can be completed in a single EROS
call and return, the protocol
does not require synchronous RPC.
Differences between dp and 9p
include:
- A single dp session can act on behalf of
multiple authorities.
- The dp server (at least in Version 0) trusts
the client rather more than it does in 9p.
- dp provides a mechanism for early abort of an
I/O operation, the Tabort request. This is
necessary in support of POSIX signals and other asynchronous
signalling mechanisms.
These differences enable support of the full POSIX semantics on the
client, provided that the client is suitably authorized to support
setuid.
The version of the protocol described here is preliminary, and we
expect will soon be discarded. In particular, we are disastisfied
with the need to trust the client, and believe that this is basically
a bad idea. Still, this version of the protocol is sufficient to get
us started.
2. The Protocol
A dp client holds a capability to the
dp server. The capability encapsulates a session.
Given this, there is no real need for session level authentication as
in 9p. Posession of a key to the server is a
necessary and sufficient condition to speak to that server. The
design assumes that distinct clients will hold distinct keys to the
server.
Clients and Servers communicate using a byte-stream protocol
rather than the EROS call/return model. This allows the server to
operate asynchronously. Request and responses are encoded into a
sequence of bytes, and sent to the server using the call/return
mechanism as a transport.
Each time the client invokes the object server, it sends an EROS
message of the form:
| W |
Size of response buffer (must be greater than or
equal to 4) |
| B* |
Server requests, expressed as a stream of bytes. |
The server, in its turn, responds with a message of the form:
| W |
Number of request bytes accepted |
| B* |
Server responses, expressed as a byte stream. |
Requests are byte-formatted sequences having the following components:
| request |
A 16 bit field indicating the request. |
| tag |
A 16 bit field whose content is specified by the client, and
will be contained in the reply message from the server. |
| arguments |
Argument specific to a particular request. |
Each transmitted data buffer should be thought of as appended to
the stream of messages already sent. Depending on how much space the
server has available, it may elect to receive less than the full
transmission. In this event, it is the client's responsibility to
drain the response buffer and try again.
Most arguments have a fixed length, which is indicated next to the
name of the argument in the specification. Some arguments are shown
with the notation [*N],
indicating a variable-length argument of length less than or equal to
N. In the byte stream, such arguments are preceded by their
byte count. If the maximum will fit within a single byte, the length
is encoded in a single byte. Where the maximum will not fit within a
single byte, it is encoded as a two byte size.
All numbers are encoded in the byte stream in little
endian format, regardless of the encoding of the native hardware.
Most object servers respond to requests promptly. Some object
types do not guarantee that reads and writes will complete without
blocking. Read and write requests on such objects will eventually
complete, but no timing guarantees are made by the server. A
mechanism is provided to allow the server to signal completion of
blocking I/O.
3 The Transport
The remainder of this note describes the protocol between the
object server and the client. It describes how to set up and tear
down a connection, the transmit and receive mechanisms, and the
formats of each individual message.
3.1 Connection Setup and Teardown
The first step in establishing communication between client and
server is for the client to invoke the server with a
connect invocation. The connection is terminated by
the disconnect invocation.
- Connect (OC = 1)
-
Initializes a connection between client and object server.
|
Reply
|
W
|
If nonzero, indicates that the server implements
asynchronous I/O. The client should be prepared to
perform the IOWait invocation to learn when such I/O is
completed, or arrange to do periodic transfers.
|
|
Result
|
RC_OK
|
Connection is established
|
- Disconnect (OC = 2)
-
Causes the connection to be torn down. The client's key to
the server is rescinded.
|
Result
|
RC_OK
|
Connection is established
|
3.1 The Byte Stream
Requests and responses between client and server are transmitted
as a sequence of byte-formatted messages. The request and response
byte streams are transmitted between client and server using the call
and return mechanisms as a transport.
The basic invocation of this protocol is the
transfer invocation, which moves bytes back and forth
between client and server.
- Transfer (OC = 3)
-
Transmits requests to the server and collects responses.
The server may not accept all of the transmitted byte string,
but is guaranteed to accept an integral number (possibly 0) of
requests.
|
Request
|
W
|
The space available in the response buffer. It is
preferable for this be as large as possible. If the
receive buffer size provided is less than 4, no requests
will be accepted by the server.
|
|
B*
|
Server requests, formatted as a byte string.
|
|
Response
|
W
|
The number of bytes accepted from the sent string.
The server will accept requests only if there is
available space in the output buffer to place the
results, so a failure to accept requests generally means
that the output buffer should be drained and the request
should be reattempted.
|
|
B*
|
Server replies, formatted as a byte string.
|
|
Result
|
RC_OK
|
Transmission succeeded, reply has emptied the response buffer.
|
|
RC_OK+1
|
Transmission succeeded, response buffer contained more
responses than would fit in the response buffer provided
by the client. Client should transmit
again (typically with a null request string) to collect
remaining responses after processing the responses
received.
|
|
RC_OK+1
|
Transmission succeeded, response buffer contained more
responses than would fit in the response buffer provided
by the client. Client should transmit
again (typically with a null request string) to collect
remaining responses after processing the responses
received.
|
|
KT+2
|
Request format error. Either the caller did not
provide for a returned acceptance count, or the
transmitted message contained a malformed message
sequence.
|
- IoWait (OC = 4)
-
Blocks until an asynchronous I/O completes and the client
is not expecting a response. Used to indicate that the client
should perform a transfer operation to obtain
the result of the asynchronous I/O.
The usual mechanism for handling asynchronous I/O is for
the client to establish a helper domain. The helper domain
loops performing an IoWait invocation and then calling the
client to inform it that a transfer operation should be performed.
|
Result
|
RC_OK
|
An asynchronous I/O has completed. |
|
KT+1
|
Connection has been torn down. |
The use of the IoWait invocation is optional.
Clients may use periodic wakeup to accomplish the same result without
an additional domain.
4 The Protocol
Each dp message consists of a sequence of bytes
containing the request number, a client-supplied tag, and the
arguments to the request. Each message is described in detail below.
One area that requires further specification is the possible error
values that can be returned by reply messages.
Every message has a corresponding reply message. The reply
message has a message number that is one higher than the request
number, and contains the same tag as was specified by the client in
the request. The tag, in effect, acts as a sequence number for
outstanding requests. A message will either receive a corresponding
reply with the identical tag or an Rerror response indicating that the
request failed.
- (0) Rerror tag[2] err[2]
- This response is issued by the server to indicate that an error
occurred on the request with the corresponding tag. This
message replaces the normal response message. err is
an integer (values to be defined) indicating the error that
occurred. ename is a text representation of the
corresponding error number
- (1) Tattach tag[2] cid[2]
credentials[8] aname[*255]
-
Requests that the object server bind the root of the hierarchy
named by aname to the given cid.
- (2) Rattach tag[2] cid[2]
uoid[8]
-
Response to the Tattach message. The returned cid
should be the same as the cid named by the client.
The uoid is an 8-bit field that is unique among all
objects in this hierarchy.
- (3) Tclone tag[2] oldcid[2]
newcid[2]
-
Requests that the object server make a duplicate of
oldcid in newcid. This message is usually
issued prior to performing a walk operation.
- (4) Rclone tag[2] cid[2]
-
Response to the Tclone message. The returned cid
should be the same as the newcid named by the client.
- (5) Twalk tag[2] cid[2]
name[*255]
-
Performs a directory traversal. The cid supplied
by the client must identify a directory. On completion, it
identifies the component object named by name (if
any).
- (6) Rwalk tag[2] cid[2]
uoid[8]
-
Response to the Twalk message. The returned cid
should be the same as the cid named by the client.
The uoid is an 8-bit field that is unique among all
objects in this hierarchy.
- (7) Topen tag[2] cid[2]
mode[1]
-
Opens the named cid for reading and/or writing,
according to mode. Legal values for mode
are:
| 0 |
Read access |
| 1 |
Write access |
| 2 |
Read and write access |
| 3 |
Execute access |
In addition, the following values can be or'd with the
supplied mode, causing the indicated actions:
| 0x10 |
Truncate. If the file is not append only, and the
sender of the message has write permissions (in version
0 this is always true), the file is truncated. If the
file is append-only, the open will succeed but the file
will not be truncated. |
| 0x40 |
Remove on close. File should be removed when the fid
is clunked, which requires write permission on the
directory. |
- (8) Ropen tag[2] cid[2]
uoid[8]
-
Response to the Topen message. The returned cid
should be the same as the cid named by the client.
The uoid is an 8-bit field that is unique among all
objects in this hierarchy.
- (9) Tcreate tag[2] cid[2]
perm[4] mode[1] name[*255]
-
Creates a new object or directory in the directory named by
cid, opening it according to the mode specified by
mode. for reading and/or writing, according to
mode. Valid modes are as described in the
open message.
The mode word indicates whether an object or directory is
to be created and the permissions to be associated with the
newly created object. The file mode is composed of the
or of the following values:
| 0x1 |
Others (not user or group) can execute this
file. |
| 0x2 |
Others (not user or group) can write this
file. |
| 0x4 |
Others (not user or group) can read this
file. |
| 0x8 |
Members of the file's group (including the owner) can
execute this file. |
| 0x10 |
Members of the file's group (including the owner) can
write this file. |
| 0x20 |
Members of the file's group (including the owner) can
read this file. |
| 0x40 |
The file's owner can execute this file. |
| 0x80 |
The file's owner can write this file. |
| 0x100 |
The file's owner can read this file. |
| 0x2000000 |
Object is "sticky". |
| 0x4000000 |
Object is setgid. |
| 0x8000000 |
Object is setuid. |
| 0x20000000 |
Created object requires exclusive access -- only one
user can open the object at a time. |
| 0x40000000 |
Created object is append only. |
| 0x80000000 |
Created object is a directory. |
Objects are initially created with the same group as their
containing directory, and the user implied by the cid.
If desired, the group can subsequently be changed via the
wstat message.
- (10) Rcreate tag[2] cid[2]
uoid[8]
-
Response to the Tcreate message. The returned cid
should be the same as the cid named by the client.
The uoid is an 8-bit field that is unique among all
objects in this hierarchy.
- (11) Tread tag[2] cid[2]
offset[8] count[2]
-
Perform a read operation. The count must be <= 8192.
The number of bytes returned may be less than the number
requested if end of file is encountered during the read or the
read is aborted.
- (12) Rread tag[2] cid[2]
eof[2] pad[1]
data[*8192]
-
Response to the Tread request. The actual message will be
padded to a two-byte boundary. If end of file was encountered,
the eof field will be nonzero.
- (13) Twrite tag[2] cid[2]
offset[8] data[*8192]
-
Perform a write operation. The count must be <= 8192.
The number of bytes written may be less than the number of
requested if the write is aborted by an abort message or the
associated connection is lost (e.g. user hangs up the phone).
- (14) Rwrite tag[2] cid[2]
eof[2] count[2]
-
Response to the Twrite request. The actual message will be
padded to a two-byte boundary. If end of file was encountered,
the eof field will be nonzero.
- (15) Tclunk tag[2] cid[2]
-
Tells the server that the indicated cid will no
longer be used.
- (16) Rclunk tag[2] cid[2]
-
Reassures the client that the server has in fact forgotten
about cid.
- (17) Tremove tag[2] cid[2]
-
Tells the server that the object named by cid
should be removed. This requires write permission on the
directory containing cid. Whether or not the object
is removed, the remove operation implies a
clunk of fid.
- (18) Rremove tag[2] cid[2]
-
Informs the client that cid has indeed been
clunked (and possibly removed).
- (19) Tstat tag[2] cid[2]
-
Requests per-object information from the server.
- (20) Rstat tag[2] cid[2]
stat[44]
-
Returns the state structure of the object named by cid.
The state structure contains the following information:
| uoid[8] |
The server's unique identifier for the object named
by cid. |
| size[8] |
The size in bytes of the object named by
cid. |
| mode[4] |
The mode field of the object named by
cid. |
| uid[4] |
The user identifier of the owner of the object
identified by cid. |
| gid[4] |
The group identifier of the group of the object
identified by cid. |
| atime[4] |
The last access time of the object named by
cid. |
| mtime[4] |
The last modification time of the object named by
cid. |
| ctime[4] |
Time of last change of the object named by
cid. |
| type[2] |
The type of the object named by cid. Not
interpreted by the server. |
| minor[2] |
The "minor number" of the object named by
cid. Not interpreted by the server. |
- (21) Twstat tag[2] cid[2]
stat[44]
-
Rewrites some of the fields of the object's state
structure. The atime, mtime, and directory
bit of the mode field cannot be revised.
- (22) Rwstat tag[2] cid[2]
-
Indicates that the object's state structure has been
updated.
- (23) Tflush tag[2] oldtag[2]
-
Causes the previous request named by oldtag to be
canceled if it has not already completed. Indicates that the
client does not wish to receive a reply for the associated
operation, and will ignore any replies that may already have
been generated. Flush operations are always processed
promptly.
- (24) Rflush tag[2]
-
Indicates that the flush operation has completed.
- (25) Tabort tag[2] oldtag[2]
-
Causes the previous read or write request named by
oldtag to be aborted prematurely if it has not already
completed. If the operation is still in progress, it will be
terminated, and an Rwrite or
Rread indicating a short read will be
generated. The effect of an abort operation is always achieved
promptly.
- (26) Rabort tag[2]
-
Indicates that the abort operation has completed.
- (27) Tcredclone tag[2]
oldcid[2] credentials[8] newcid[2]
-
A clone of the object identified by oldcid is
created, having the associated credentials, and named
newcid. The new cid is not open, and does
not necessarily convey authority sufficient to let the new user
do anything with it.
- (28) Rcredclone tag[2]
cid[2]
-
Response to the Rcredclone message. The returned
cid should be the same as the newcid named by
the client.
5 Usage Notes and Issues
One issue that is not overtly addressed by this protocol is the
implementation of multiple groups. The dp open
message checks only one user and group at a time.
The effect of multiple groups can be achieved by issuing a series
of clone and open messages, passing
different credentials for each open message. The
file may end up multiply open. The server can then clunk the
cids that it doesn't use.
The current protocol does not support hard links. Hard links are
a bad idea, inducing a rich variety of puzzling behaviors for file
system consistency checkers, and I don't really want to support them
if I can avoid it. It may be unavoidable.
5.1 Pipes
Another issue that is not obviously addressed is how to deal with
pipes and other objects that are really one objects with multiple
connection points. The general idea is that the pipe server simply
serves up an infinite number of pipe hierarchies that are all named
clone. Opening a clone file has the
effect of implicitly creating a new hierarchy rather than opening the
existing one. Each hierarchy is a directory supporting two names
(e.g. 0 and 1) corresponding to the
end points of the pipe.
Fill me in!
6 Background
The design of the Dionysix name space mechanism has proven
challenging. A number of options were debated along the way:
- One domain per disk file.
- One domain per open disk file.
- One domain per file system.
The "one domain per file" approach was used by KeyNIX (see: The
KeyKOS Nanokernel® Architecture). While this approach is
probably simplest, each file domain requires at least one page for a
stack. On my linux system at home, the breakdown of files is as
follows:
| Count |
Description |
Percentage |
| 31,071 |
All files |
100.0% |
| 18,789 |
Files <= 4096 bytes |
60.5% |
| 13,134 |
Files <= 2048 bytes |
42.3% |
| 5,301 |
Files <= 1024 bytes |
17.0% |
Given this, the one domain per file design seems prohibitive.
The "one domain per open file" approach is more
promising. The principle problem with this approach lies in the need
for reference counting. If a UNIX server manages to die without
closing its currently open files, the open file objects could last
forever. In the end, these dangling open files represent a serious
challenge for robustness.
The final design is the one we have adopted. File systems should
not go away when the referencing UNIX server dies, so the dangling
reference problem does not arise. In addition, systems like Plan 9
and NFS have adopted this approach successfully, so there is some
reason to be optimistic that it will work correctly.
Copyright 1998 by Jonathan Shapiro. All rights reserved. For terms of
redistribution, see the
GNU General Public License
| |