Some very thought-provoking ideas about OS architecture.
Tue, 22 Jun 1999 15:11:05 -0400

Here's the response to Linus and others on the messaging/syscall discussion.

Linus wrote:

>That's why the OS boundary HAS to be equivalent to
>    read(handle, buffer, size)
>and NOT be equivalent to
>    handle->op(READ, buffer, size);
>because by definition, if you can do the "handle->op" lookup, then it's
>not a OS boundary any more - or at least it is a very BAD one.  See?

Linus and several other people seem to share the same misunderstanding about
what I wrote, so I obviously said it badly.

The confusion, I think, is over whether the descriptor dereference (I did not
*say* handle, and I did not *mean* handle) is done by the application or the OS.
Several people assumed that I meant it should be done by the application.
Indeed, if the application is able to dereference the name (in which case it
would indeed be a handle), then there is no real protection boundary.  This
would indeed be a poor design.

The issue I was trying to get at is both more simple and more complicated.  The
real question is: "What is an operating system?"

Never mind whether you do it all in one executable (monolithic style) or as a
microkernel.  There is a core nucleus to any operating system that demultiplexes
and dispatches events.  In typical designs, these events include interrupts and
system calls.  I am proposing that a better answer is for the nucleus to
demultiplex and dispatch interrupts and *descriptors*, and then let the
dispatched-to code figure out what call was made.

If you are building a monolithic system, the change in demultiplex order
(operation first vs. descriptor first) has no impact on performance.

That said, there are a number of reasons to prefer it:

1. You might want to distribute your system later.  In a distributed system you
want to know the target (i.e. the object) of the syscall, ship the work off, and
let it be done by that object.
2. You might want to restrict what syscalls an application can do.  In practice
this is invariably about limiting the objects that the application can
manipulate.  Controlling the system calls is a rather bad and indirect way to do
this.  It would be better to control the object access more directly.
3. It obviates the need for all objects to export the same system call

Before you jump on this, consider that the all of the file I/O calls in UNIX
area actually capability invocations.  So is the open() call: open() implicitly
references either the current root descriptor or the CWD descriptor, both of
which are part of the process state.  If you want to have a discussion in
UNIX-centric terms, I'm arguing that *all* system calls should rely on some
descriptor in this way, and that it should be well-defined what happens when the
descriptor is void.

For example, the whole notion of controlling signal delivery on the basis of
parent/child trees is a bad idea.  A better design would have explicit
per-process descriptors, and give authority to signal based on whether the
signaller held the appropriate descriptor.  Such a design *can* support the
current semantics, but is not limited to the current semantics.

Ultimately, I'ld note that when you strip away the syntax transformations you
discover that above the vnode boundary UNIX is a capability system, with the
notable exception of the signalling and process control mechanisms, which are
rather crufty.  With the introduction of /proc, processes have been given
descriptors as well.  What we now need to do is find a way to eliminate the
legacy interfaces that render principled security in UNIX so difficult.

Jonathan S. Shapiro, Ph. D.
IBM T.J. Watson Research Center
Phone: +1 914 784 7085  (Tieline: 863)
Fax: +1 914 784 7595

Bill Huey <> on 06/21/99 07:07:13 PM

To: (Linus Torvalds)
cc: (bcc: Jonathan S Shapiro/Watson/IBM)
Subject:  Re: Some very thought-provoking ideas about OS architecture.

> because by definition, if you can do the "handle->op" lookup, then it's
> not a OS boundary any more - or at least it is a very BAD one.  See?
>         Linus

I'm going to wait before I comment on this one.

Some of what's going on is a misrepresentation of
current OS theory within acedemic circles, which changes
the context surrounding the message passing thread.

Syscall overhead is an issue, but things in the late 90's
are different now, so folks arguments are rather dated.

I'm going to wait for the IBM guy, to reply first.


To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to
Please read the FAQ at