A New Process State Jonathan Shapiro (shap@cobra.cis.upenn.edu)
Sun, 10 Sep 1995 13:25:29 -0400

What follows is a description of an idea in progress - one that may well get abandoned. It is motivated by two concerns:

o Piss Poor Pentium Privilege Performance - the transition from user to kernel mode takes WAY too long.

o Real-time guarantees.

In the current KeyKOS design, a process that blocks in the kernel gets bounced all the way back to user mode, with it's PC pointed at the trap instruction that started the whole mess. This has the following consequences:

1 There is no need to distinguish between processes in the kernel and processes at user level.

2 Some other process can alter the instruction stream of the stalled process, thereby causing undetectable breakage of the kernel queue that the process is stalled on. This leads to the worry queue.

3 The process, upon resumption, may discover that some previously available resource is no longer available, and may block on some other queue rather than getting to the resource it was previously waiting for. This is another reason for the worry queue.

In EROS, threads queued on a resource are queued in the order that they should be woken up. Because the worry queue as I currently understand it violates queue ordering, I'm trying to get rid of it.

Idea 1: No more worry queues

[Not convinced I have all the details right, but I'm pretty sure
something along these lines is workable.]

Every queue has an associated "owner" which is tracked by the kernel. The owner is either the most recently awakened domain or the interrupt task that will cause the associated resource to become available.

[ Note that this discussion does not consider I/O completion queues,
which wake up all sleepers. I should probably use a different term for queues that serialize callers v/s queues that wake everyone up. ]

When a domain is woken up from the head of a queue, it becomes the "owner" of that queue. A domain can own at most one queue at a time, and a pointer to the queue currently owned by the domain is kept by the kernel in the DIB. Ownership of a queue implies that the resource associated with that queue is ready for use, and that this domain has been granted the exclusive authority to use it. When the domain has completed it's use of the resource, it either wakes up the next stalled domain by transferring ownership (if the resource is ready immediately) or simply gives up ownership (conceptually: transfers ownership to the interrupt task that will kick-start the resource).

When a domain that currently owns a queue enters the kernel, and goes to sleep on some *other* queue, it transfers ownership of the old queue to the next domain in line on the old queue. I think this solves case (3) above completely.

When a domain is altered in a way that causes it to be decached from the DIB, it is assumed to have entered a non-running state. If it owns a queue at the time of decaching, it transfers ownership to the next domain in the queue (thereby waking the next domain up). If a domain leaves the running state for any reason it also relinquishes any queue that it owns.

When a domain enters the kernel by reason of it's execution quanta having expired, it is queued on the processor availability queue. This has the effect of transfering ownership of the previously owned queue if any, and therefore solves case (2), at the cost that the resource may be idle for one full scheduling quanta interval.

I believe that this properly preserves queue ordering, and eliminates the need for the worry queue. Given the desire for EROS to support some real-time guarantees, I dislike the scheduling quanta interval issue.

Idea 2: New Internally-Visible Process State

The i386/i486/Pentium/P6 family is so DOG slow at going from user to supervisor mode (129+ cycles) that I really would like to avoid doing this more than once per syscall. To avoid it, I have in mind to define a new internally visible process state: KernStall. This state is not visible outside the kernel.

A process in the KernStall state has entered the kernel and saved it's registers to the DIB, but has NOT performed any other actions in the kernel. In particular, it has no per-process state on the kernel stack or anywhere in the kernel other than the DIB. The state exists solely to allow us to resume a process without recrossing any privilege boundaries.

If a process in the KernStall state is stopped using the domain service key, it stops with the PC pointing to the currently executing instruction (wherever it would have been if we had sent it all the way back to user mode).

In principle, the PC of a running process is not well-defined. The effects of rewriting the instruction under the PC of a running process (including a stalled process) are undefined, except that doing so is guaranteed NOT to compromise the privilege architecture (e.g. by altering the arguments of a trap instruction out from under the supervisor). If necessary, arguments to the instruction currently executing will be saved to the DIB to enforce this.

Processes in the KernStall state are conceptually bounced back to user mode for checkpointing.

Once such a process does return from the kernel, it will either restart the instruction (e.g. load that took a page fault) or resume at the next instruction (e.g. trap completion).

The reasons for proposing this change are:

o Allows us to guarantee that when we do run the lead process on a queue that process will really be the next process to access the queue.

o Reduces the duration of multiple sequential stalls by eliminating the privilege boundary crossings.

o Eliminates the instruction stream rewrite stall.

The last point deserves explanation: the domain will not be bounced back to user mode until it has made its way completely through the kernel invocation and is ready to resume running, at which point it is guaranteed to no longer own any queue at all. Therefore, altering the I-stream doesn't jeopardize any queues.

I initially thought that this state needed to be visible to holders of the domain service key, but I no longer think so.

This new state does NOT introduce any new sources of misbehavior when you rewrite the instruction stream under the PC. Altering the instruction stream of a running process was always a race condition. If you rewrite the instruction at the current PC, there is no predicting whether the new stream will get picked up before or after the current instruction. In the presence of HW I-stream prefetch (all modern processors) there's an even bigger window of uncertainty.

In truth, if you didn't stop the process before rewriting the instruction space, there were no guarantees anyway.

Jonathan