Kernel Stacks in EROS 2.0This note describes an unimplemented design for long-blocking kernel stacks in the EROS kernel. This requires a significant reconceptualization of certain parts of the architecture. Implementation of this design has been deferred in favor of user mode drivers. IntroductionOver the past few years, EROS has started to deviate significantly from KeyKOS (its predecessor). Some examples of the differences include:
This design note describes yet another significant change: retiring of the interrupt-style kernel design. I expect that this abandonment will generate considerable controversy. Interrupt-Style KernelsEROS is currently an interrupt-style kernel. For those who may not know, here is a brief summary of how an interrupt-style kernel works. Application processes perform calls that enter the kernel to perform some operation. While in the kernel, they run on a kernel stack. In a uniprocessor there is one kernel stack. In a multiprocessor there is one kernel stack per CPU. If the kernel operation runs to completion without blocking, things proceed pretty much as you would expect. The kernel flow of control proceeds down the stack, does some work, and returns up the stack to some process. If the kernel operation blocks, then what happens is as follows:
One way to think about this is that when a process goes to sleep it throws an exception that is caught at the user/supervisor boundary. This style of kernel design imposes three design constraints:
Unfortunately, if a process has been bounced out of the kernel, there is no guarantee that it will retry the system call. As a result, there is no guarantee that a state machine driven by application calls will ever proceed to completion. This means that all intermediate states must be ``safe.'' Sometimes this causes difficulty. One example of such a difficulty is bus configuration. When probing a SCSI bus to discover connected devices, the code must examine each device in turn to see where devices are connected, and must then ask the device to identify itself. Each of these steps involves blocking to wait for a response. Note that this activity wants to be ``all or nothing.'' Unfortunate things can happen if the entire probe sequence does not run to completion. In the current EROS (and KeyKOS) systems, this means that things like SCSI probe cannot be driven by application code, because there is no guarantee that the application will run the state machine to completion. Instead, activities like this are converted into kernel state machines that issue the I/O requests on behalf of the kernel rather than on behalf of some process. This in turn complicates the main part of the driver, because it must now be prepared to distinguish process-originated requests from kernel-originated requests (one can block, the other cannot). Stepping back a level, what is happening here is that a policy decision (the decision to use an interrupt-style kernel) is interfering with a reasonable and straightforward implementation. All of the existing hard data says that there should be no performance difference between an interrupt-style kernel and a stack-based kernel. Stack-based kernels are by far the more common kernel style, and programmers find this style much more familiar and therefore much more maintainable. Given all of this, it is worth stepping back to describe why EROS is an interrupt-style kernel and to re-examine whether perhaps this decision can and should be changed. Motivation for the Interrupt Kernel DesignIn digging through the early GNOSIS and KeyKOS documentation, I did not find much rationale documentation for the decision to implement an interrupt-style kernel. The following is my conjecture based on various early conversations I had with Norm Hardy, Charles Landau, Bill Frantz, and others. I hope that they will correct any errors and/or add to the rationale for archival purposes.
Alternatives to ChangeIn the most immediate sense, the introduction of kernel stacks is motivated by the need for device drivers. There is a large body of device driver code in the UNIX world that we would like to use rather than rewrite, and for better or worse this code assumes that things can block in the kernel without being forced to restart the kernel invocation from scratch. I've been looking at this issue for a long time, and I've finally decided that either the drivers need to be rewritten or EROS needs to change. An alternative to the design described here is to switch completely to user-mode drivers. A design for user mode drivers is here. In principle, all of the drivers could be moved to user level, and this is what we are currently planning to do. However, user level drivers introduce two complications:
We herein describe an alternative, equally feasible design. Deep StacksA ``deep stack'' is similar to a per-process kernel stack, except that it isn't per-process. Instead, the kernel has a fixed number of kernel stacks that is something on the order of nCPU+32 or nCPU+64. At all times, a CPU has assigned to it one of these stacks. In an ordinary capability invocation, the process enters the kernel on the current CPU's kernel stack. It either runs the invocation to completion or stalls and restarts the execution from the beginning of the system call. With deep stacks, we now introduce the possibility of an invocation that will remain ``live'' in the kernel. Such an invocation is known as a ``deep invocation.'' These are used primarily by driver calls, and these are the calls that make use of deep stacks. Here is how: At some point in it's execution, the kernel code realizes that it must make a deep invocation. At this threshold, it must allocate a deep stack. The purpose of allocating this deep stack is to ensure that every CPU will always have a stack. If the current process ends up asleep in such a way that it holds onto its current kernel stack, it will supply the alternate stack to the CPU so that the CPU can use it to continue execution on behalf of other processes. Conversely, if the current process returns to user land successfully, it simply returns the ``spare'' stack to the available pool. Note that the stack state of the active thread of control is never copied. This is merely an exchange protocol for the stack. Activity vs. Process Kernel StacksBecause an EROS process can be swapped out, we must instead associate the deep stack with the Activity data structure, which remains in memory as long as the process is either running or sleeping in the kernel (this used to be known as the Thread structure). Actually, this association makes a lot of sense, because while a kernel stack is always devoted to some Activity, a given Activity does not always return to the same process. Conceptually, then, we can think of the original process as being blocked at the user/kernel boundary waiting for some Activity to come along and activate it. CheckpointingThere is a further constraint on deep stacks: they are not checkpointed. This limitation is imposed for several reasons:
This means that a deep stack activity must correctly handle the possibility that it will be restarted entirely from the beginning. There are three ways that this can be done:
CommentsGiven how long we (Norm, Charlie, Bill, and I) collectively pondered this issue, it is surprising in retrospect how simple the solution can be. We became trapped an the ``either-or'' assumption: that a kernel was either an interrupt kernel or a stack based kernel, and neglected the possibility of a hybrid design. Copyright 2001 by Jonathan Shapiro. All rights reserved. For terms of redistribution, see the GNU General Public License |