Re: i386 compatibility. shapj@us.ibm.com
Fri, 27 Aug 1999 18:02:50 -0400

Apologies for the delay in replying -- it's been a busy week.

You're pushing a good idea, but just for a moment let me explore the contrary view.

> On an i386, it is possible to manage the whole system at
> user-level (with Privilage Level 3 code), except for running the
> privilaged isntructions. That includes managing page tables, OS data,
> descriptor tables, etc.
> Is this possible on non-i386 architectures?

In general the answer is yes. Most processors have a few instructions that can only be executed in supervisor mode -- the "load address space pointer" is probably the most obvious one. If the operating system wishes to manage address spaces from user mode, it can establish an address space that includes the system mapping table structures. The mapping structures can then be manipulated by this code provided that care is taken to ensure that translation buffers (TLBs) get flushed when appropriate. On some processors, TLB invalidate instructions are privileged.

This statement, however, oversimplifies things a bit. First, coordinating mapping changes on a multiprocessor requires interprocessor signalling (because most multiprocessors do not broadcast TLB invalidates). This tends to require that the invalidation operation be implemented in supervisor mode. It almost certainly *should* require special authority on a multiprocessor, lest an application implement a denial of service attack by rapidly broadcasting random TLB invalidations.

Also, the events that *create* new mappings (page faults) are easier to handle out of kernel than the events that *remove* mappings, many of which result from multiplexing of the machine. Ageing, for example, causes a page to be swapped out, requiring that all mappings to that page be invalidated. The address validator must be kept informed, and if it is not in the kernel care must be taken to ensure that the address space manager does not itself get paged out.

Finally, note that there really isn't any benefit or marginal security to managing these things outside the kernel. A program (user mode or not) that holds mappings for the entire physical memory is effectively in the same protection domain as the kernel anyway. Execution of privileged instructions is not an issue we are really concerned about for such a program, as it stands in a position to rewrite the kernel code in any case. Perhaps more important, there are other codes that need to do things like pin pages temporarily, and there is a real need for low overhead on that sort of operation.

The cost to removing this sort of code from the kernel *is* pretty clear. EROS strikes a compromise by defining *logical* structures that control mapping. These are manipulated by user-level code. The associated physical structures are built by the kernel.

> I still think, that limiting drivers to not
> have access to privilaged instructions they don't need, or to all
> memory, could be closer to the principle of least privilage

It would. Also access to device registers for other devices.

In practice, device drivers need access to all of memory, because they need to put the data where the user says, and that could ultimately be anywhere. The alternative is to take the data first into device driver memory and then either copy it or do some kind of page exchange with the client process. Either of these latter operations is quite expensive. In the case of networking, for example, it would be difficult to keep up at 100Mhz or higher speeds with such a multi-copying design.

> The issue of crossings of ring-levels is resolved by the simple idea of
> code always running at privilage level of 3, except for VERY few pieces
> of privilaged-instructions code.

Well, not really. The catch is that to switch from one ring-three process to another you need to go by way of a ring 0 process (that's slightly oversimplified, but roughly accurate). Therefore, the design you propose actually leads to *more* ring crossings. You end up taking a transition into ring zero to set up the data transfer, and then two additional ring 3 to ring 0 to ring 3 transfers: one from client app to driver and the other to return. The cost of this is high. Even if the driver is manipulating a relatively slow device such as a disk, the time you spend doing ring crossings is time that you do *not* spend doing useful computations.

In EROS, the interprocess ring crossing carries added semantics -- an implicit mutual exclusion -- that often proves to *simplify* the program that you switch to. In this situation the cost of the ring transfers is typically made up for by the simplification.

> A malicious driver could always overwrite the
> memory using a controlled Bus Master, but a buggy one wouldn't

That depends greatly on the bug. Even bugs like wild pointers have other consequences. Containerizing the driver will cause the driver to halt, but will often leave the hardware device in a state that eventually locks up the machine.

Having covered some disadvantages, here are some ways to get the best of both worlds:

Drivers can usually be divided into two distinct parts: the part that manipulates the device hardware (registers, DMA channels, and the like), and the part that does policy (flow control, request sorting, and the like). The former does not in practice benefit from being outside the kernel. The latter can often be removed with perfectly acceptable overhead. Also, in a system that admits physical memory as a resource it is possible to imagine passing around "descriptors" to physical memory, and thereby authorizing drivers to modify only what they ought to in a given invocation. This is much trickier in EROS, for example, where there is a single-level store to be considered.

I have considered removing drivers from the EROS kernel. This is not so much for protection as for the advantages of demand loading only those drivers that are needed. The tricky part is dealing with the fact that the EROS process model assumes persistence, and persistence isn't really appropriate for bottom-layer device drivers. Also, there is a slight challenge in getting the disk drivers loaded, as without the disk drivers we aren't going to make much progress loading anything else.

Persistence being the tricky part, I eventually arrived at the following approach: imagine that there is a reserved range of objects having the following properties:

  1. They are exempted from persistence.
  2. Once loaded, they are pinned until explicitly released.

One might then arrange for an external program to manage a "driver file system" in this range. This file system would be so designed as to place disk drivers at the front of the reserved range. On startup, the bootstrap loader would preload these objects into memory, from which point the kernel would be able to successfully invoke the disk driver processes. There is a bit more to it than this, but you get the general idea.

The main difficulty is that suddenly we are spreading sensitive code all over the place, and that deserves some care and skepticism. The "principle of least privilege" is important, but so is being sure that it all works correctly. The principle of least privilege should not be carried to such an extreme that it results in loss of clarity of function. The latter is every bit as important.

Anyway, those are some rambling thoughts on your question.

Jonathan S. Shapiro, Ph. D.
IBM T.J. Watson Research Center
Email: shapj@us.ibm.com
Phone: +1 914 784 7085 (Tieline: 863)
Fax: +1 914 784 7595