Kernel Stacks in EROS 2.0

    This note describes an unimplemented design for long-blocking kernel stacks in the EROS kernel. This requires a significant reconceptualization of certain parts of the architecture.

    Implementation of this design has been deferred in favor of user mode drivers.

Introduction

Over the past few years, EROS has started to deviate significantly from KeyKOS (its predecessor). Some examples of the differences include:

  • EROS includes provision for a processes that are scheduled by the general-purpose scheduler but execute entirely within the kernel. These were introduced in order to simplify the driver model.

  • The installation and startup mechanism for EROS is considerably different. For example, EROS installable system images are precompiled by a program linker that prebuilds a disk image by hand. EROS has no equivalent to the KeyKOS ``big bang''.

  • Disk ranges are untyped. EROS can place pages and nodes in the same range, which had surprisingly positive consequences for storage allocation and management.

  • EROS is moving toward a revised IPC specification (IPC version 4) allowing multiple unbounded strings without violating the idea that there should be atomic units of operation in the kernel.

This design note describes yet another significant change: retiring of the interrupt-style kernel design. I expect that this abandonment will generate considerable controversy.

Interrupt-Style Kernels

EROS is currently an interrupt-style kernel. For those who may not know, here is a brief summary of how an interrupt-style kernel works.

Application processes perform calls that enter the kernel to perform some operation. While in the kernel, they run on a kernel stack. In a uniprocessor there is one kernel stack. In a multiprocessor there is one kernel stack per CPU. If the kernel operation runs to completion without blocking, things proceed pretty much as you would expect. The kernel flow of control proceeds down the stack, does some work, and returns up the stack to some process.

If the kernel operation blocks, then what happens is as follows:

  • The process is queued on some queue in the kernel.

  • If necessary, the process instruction pointer is adjusted so that the process will reissue the system call when it wakes up.

  • The process is pushed out of the kernel and back to user level. It therefore does not occupy a kernel stack or reserve any kernel resources while it is asleep. In fact, it is not permitted to do so.

One way to think about this is that when a process goes to sleep it throws an exception that is caught at the user/supervisor boundary.

This style of kernel design imposes three design constraints:

  1. All necessary resources must be allocated before a kernel call is permitted to modify anything that is visible outside the kernel. This requires that each possible flow of control through the kernel has a well-defined ``commit point.'' Prior to the commit point no modifications to externally visible state are permitted. After the commit point the call is guaranteed to complete because it already posesses all of the resources that it needs. In fact, after the commit point the code is not permitted to perform further resource allocation.

  2. Kernel calls are not permitted to block in the kernel while holding resources.

  3. Because of rules (1) and (2), it is difficult to implement operations that block more than once (i.e. sequentially), because the first block causes the entire call to be restarted from the beginning. The consequence of this is that nested transactions must be converted into explicit state machines. Sometimes this is a natural programming style. Often it is not.

Unfortunately, if a process has been bounced out of the kernel, there is no guarantee that it will retry the system call. As a result, there is no guarantee that a state machine driven by application calls will ever proceed to completion. This means that all intermediate states must be ``safe.'' Sometimes this causes difficulty.

One example of such a difficulty is bus configuration. When probing a SCSI bus to discover connected devices, the code must examine each device in turn to see where devices are connected, and must then ask the device to identify itself. Each of these steps involves blocking to wait for a response. Note that this activity wants to be ``all or nothing.'' Unfortunate things can happen if the entire probe sequence does not run to completion.

In the current EROS (and KeyKOS) systems, this means that things like SCSI probe cannot be driven by application code, because there is no guarantee that the application will run the state machine to completion. Instead, activities like this are converted into kernel state machines that issue the I/O requests on behalf of the kernel rather than on behalf of some process. This in turn complicates the main part of the driver, because it must now be prepared to distinguish process-originated requests from kernel-originated requests (one can block, the other cannot).

Stepping back a level, what is happening here is that a policy decision (the decision to use an interrupt-style kernel) is interfering with a reasonable and straightforward implementation. All of the existing hard data says that there should be no performance difference between an interrupt-style kernel and a stack-based kernel. Stack-based kernels are by far the more common kernel style, and programmers find this style much more familiar and therefore much more maintainable.

Given all of this, it is worth stepping back to describe why EROS is an interrupt-style kernel and to re-examine whether perhaps this decision can and should be changed.

Motivation for the Interrupt Kernel Design

In digging through the early GNOSIS and KeyKOS documentation, I did not find much rationale documentation for the decision to implement an interrupt-style kernel. The following is my conjecture based on various early conversations I had with Norm Hardy, Charles Landau, Bill Frantz, and others. I hope that they will correct any errors and/or add to the rationale for archival purposes.

  • Stack-based kernels make it easy to allocate multiple resources in sequence. This can lend itself to deadlock in careless implementations (KeyKOS certainly was not a careless implementation).

  • Kernel stacks tend to point to kernel resources. If the system is restarted, these resources may end up at different addresses or may not exist at all. This means that all system calls must be prepared to restart from scratch regardless.

  • The original KeyKOS kernel was implemented in assembly langauge for the S/360. In assembly language, a ``straight through'' style certainly led to greater efficiency, and was probably more natural than it is in a higher level language.

  • KeyKOS was originally implemented for the S/360, and the kernel ran in unmapped mode. It may have been felt that a process having a stack in such a mode was prone to errors.

  • In most stack-based operating systems, control is returned to the calling process with very few exceptions (in UNIX, the only exception is the fork() system call). In KeyKOS and EROS any kernel invocation can return to an arbitrary process specified by the invoker. One result of this is that there comes a point in the code when the per-process kernel stack is running on behalf of the ``wrong'' process. It is therefore a bit awkward to think of this stack as being associated with a process. Because of this, the conventional interpretation of the stack-based kernel model (which is that every procss has a per-process stack) is conceptually a bad fit for EROS.

  • Finally, I'm sure there was some question about whether kernel stacks might need to be checkpointed, and how this might create ``interesting'' new complications for the kernel implementation.

Alternatives to Change

In the most immediate sense, the introduction of kernel stacks is motivated by the need for device drivers. There is a large body of device driver code in the UNIX world that we would like to use rather than rewrite, and for better or worse this code assumes that things can block in the kernel without being forced to restart the kernel invocation from scratch. I've been looking at this issue for a long time, and I've finally decided that either the drivers need to be rewritten or EROS needs to change.

An alternative to the design described here is to switch completely to user-mode drivers. A design for user mode drivers is here. In principle, all of the drivers could be moved to user level, and this is what we are currently planning to do. However, user level drivers introduce two complications:

  • There is a performance penalty due to extra data copies.

  • The user-mode drivers must be non-persistent, and there is an impedance matching issue in crossing the persistence boundary.

We herein describe an alternative, equally feasible design.

Deep Stacks

A ``deep stack'' is similar to a per-process kernel stack, except that it isn't per-process. Instead, the kernel has a fixed number of kernel stacks that is something on the order of nCPU+32 or nCPU+64. At all times, a CPU has assigned to it one of these stacks.

In an ordinary capability invocation, the process enters the kernel on the current CPU's kernel stack. It either runs the invocation to completion or stalls and restarts the execution from the beginning of the system call.

With deep stacks, we now introduce the possibility of an invocation that will remain ``live'' in the kernel. Such an invocation is known as a ``deep invocation.'' These are used primarily by driver calls, and these are the calls that make use of deep stacks. Here is how:

At some point in it's execution, the kernel code realizes that it must make a deep invocation. At this threshold, it must allocate a deep stack. The purpose of allocating this deep stack is to ensure that every CPU will always have a stack. If the current process ends up asleep in such a way that it holds onto its current kernel stack, it will supply the alternate stack to the CPU so that the CPU can use it to continue execution on behalf of other processes. Conversely, if the current process returns to user land successfully, it simply returns the ``spare'' stack to the available pool. Note that the stack state of the active thread of control is never copied. This is merely an exchange protocol for the stack.

Activity vs. Process Kernel Stacks

Because an EROS process can be swapped out, we must instead associate the deep stack with the Activity data structure, which remains in memory as long as the process is either running or sleeping in the kernel (this used to be known as the Thread structure). Actually, this association makes a lot of sense, because while a kernel stack is always devoted to some Activity, a given Activity does not always return to the same process. Conceptually, then, we can think of the original process as being blocked at the user/kernel boundary waiting for some Activity to come along and activate it.

Checkpointing

There is a further constraint on deep stacks: they are not checkpointed. This limitation is imposed for several reasons:

  • Any resource pointers in the stack point to resources that may have moved or may no longer exist.

  • If a recovery occurs, any setup that the thread of control on the deep stack has performed has been lost. Therefore, it makes no sense to checkpoint it in any case.

This means that a deep stack activity must correctly handle the possibility that it will be restarted entirely from the beginning. There are three ways that this can be done:

  1. Arrange that the whole process is restartable. For example, the SCSI probe example discussed above can safely be restarted from the beginning if the machine has crashed in the middle.

  2. Arrange that the whole activity is idempotent. For example, writes to a disk drive can often be repeated if it is going to the same sector as before. Note that in general it cannot be assumed that the previous write completed, but it also cannot be assumed that the write did not begin. The value on the disk may actually reflect a partial, incomplete, and therefore invalid block (yes, this actually happens).

  3. Arrange that the capability by which the invocation was performed does not survive checkpoint. This is the most common solution and the least obvious.

    The common use of deep invocations is device invocations. If these originate directly from an application process, then they originated from a device capability. When the system restarts, all outstanding device capabilities are invalidated. When the calling process retries the operation, it will no longer be calling the driver at all. Instead it will call a zero number capability, whereupon it will be obliged to make arrangements to reacquire access to the device.

    Alternatively, the device invocation may have originated as the consequence of an object fault (either a page/node key invocation or an address reference fault). Note that page and node key invocations are designed to be shallow stack invocations, and are therefore restartable. Similarly, page faults are idempotent.

Comments

Given how long we (Norm, Charlie, Bill, and I) collectively pondered this issue, it is surprising in retrospect how simple the solution can be. We became trapped an the ``either-or'' assumption: that a kernel was either an interrupt kernel or a stack based kernel, and neglected the possibility of a hybrid design.


Copyright 2001 by Jonathan Shapiro. All rights reserved. For terms of redistribution, see the GNU General Public License