Re: Architecture of Backing Store Descriptors Jonathan Shapiro (shap@viper.cis.upenn.edu)
Tue, 22 Nov 94 21:04:58 -0500

[EROS readers: I've responded to the KeyKOS questions here, though you're welcome to enhance and correct me. I'll respond to the NewSys questions seperately. -- shap]

Jonathan, is there an archive I could get?

The archive is only about five messages, none of which are terribly relevant. What's going (for the moment) under the name of NewSys is a new effort, and you are (regrettably) suffering some of the birth pangs. I'm working up an architectural overview, which should simplify things.

My apologies for the following - the response got to be rather longer than I initially intended. Bryan has asked some good questions, and I doubt that I have answered them fully.

>As a general design principle, the kernel should not depend on a
>non-kernel task for kernel correctness.

This is generally true, but not always. The _usual_ reason for locating part of the system outside the kernel is because it isn't trusted, but that's not necessarily the only reason. Other reasons include design simplicity and debugging convenience.

I think I may have perhaps confused an issue or two here by being inadequately specific. What I said was correct, but incomplete. I was trying to respond to a reasonable question raised by someone I suspected had not had occasion to consider the consequences of user-level drivers. This may have been a disservice.

A service can be placed outside the kernel for many reasons. The issue is failure propagation. In the event that the service fails, the scope of the failure must be limited to the (transitive) clients of the failed service. Ideally, things fail in a way that allows the service to be repaired in situ.

A corollary is that it is unacceptable for the kernel to block on such a service, though it is acceptable for the kernel to block the clients of the service. It is a design error of major proportions if moving a service outside of the kernel makes it possible for the kernel to arrive at a state where it is unable to make progress due to the failure of a user-level process.

This implies that the kernel must not leave interrupts disabled while running a user task. Actually, I'ld go further and suggest that it's almost always the best policy to have the kernel buffer asynchronous events in a small buffer, as these tend to have strict performance requirements.

In NewSys, as in KeyKOS, we make a strong distinction between the correctness of the kernel and the correctness of a rehosted operating environment. If init dies, it's pretty clear that UNIX is hosed, but the kernel itself may still be operating correctly. In the KeyKOS context, there are a number of subsystems (e.g. the KeyKOS transaction facility) that run hosted directly on top of the KeyKOS primitives. It would be very unfortunate if a wayward rehosted OS could cause such a service to fail. This would in fact constitute a serious flaw in both the correctness and the security of the system.

One way to think about this is to conceptualize KeyKOS (and NewSys) as providing a virtual machine on top of which a number of "operating environments" may be running simultaneously. In a correct kernel implementation, the failure of one of these environments must not imply the failure of others unless it is due to the failure of some service shared in common by the two environments.

Given this, the situation with init is not quite parallel.

Another KeyKOS rule, I believe, is that kernel data (stored in "nodes") is fundamentally different from user data (stored in "pages"). Kernel memory can never become user memory or vice versa; kernel memory is replicated in several places on stable storage whereas user memory isn't (or isn't always?); etc.

In most KeyKOS implementations, both types of storage are replicated on disk. Nodes are in some sense more important because the failure of storage on certain nodes (e.g. a meter near the top of the tree) can mess up a large part of the system, while the failure of a page will in general destroy only a single process. That's a weak argument, but an important consideration if disk space is constrained.

The storage really is qualitatively different, however -- see below.

It seems to me that the "node" abstraction really just represents storage used by a trusted entity but allocated by an untrusted entity: or, in other words, storage depended upon by a trusted server but used on behalf of an untrusted client.

In KeyKOS, for a variety of reasons, capabilities cannot be stored in user data. Consider, for example, the difficulty of discovering when the last reference to an object goes away. The issue does not arise in systems like Mach, where ports are not persistent.

Another issue is that the possibility exists for the user code to forge a capability if they are mapped in user space, which would make the system entirely untrustable.

Actually, nodes aren't allocated in the way that you seem to mean. They are transferred, and access to them can be rescinded.

I suppose that strictly speaking, should be viewed as manufacturing nodes that will be trusted by the kernel, and as such the formatter is part of the trusted computing base. Once created by the disk formatter, however, a KeyKOS node never goes away; it simply changes hands from one client to the next, being carefully rescinded at each transfer.

This is definitely a necessary abstraction in a system like KeyKOS, but why should it be limited to kernel data, when many servers probably need a similar abstraction? For example, I'd like to be able to write an untrusted user-level server that doesn't trust its clients, but to hold its private data uses storage "provided" in some way by its clients... The server can't just accept page keys directly from its clients, because it doesn't know the security properties of those pages.

It isn't limited to kernel data, though the cases are not exactly parallel. Your server can actually trust the client to supply trustable pages, because given a page key, one can rescind all other outstanding authorities to the page. This guarantees that the page is safe, provided that you were actually handed a page key. The problem is now reduced to verifying that you in fact have a page key, which can be done using a kernel primitive service (DISCRIM), or by plugging the key into a segment and faulting on it (KeyKOS doesn't virtualize pages well, which is a hole in the system design. Norm and I spend a while trying to resolve this, but didn't come up with any terribly wonderful ideas).

Alternatively, the client can hand the server a key to a space bank from which the server can buy pages from a space bank. It is possible to verify that the space bank is for real by checking it's brand to verify that it was created by the space bank creator. Since the space bank creator is trusted, the problem is resolved.

Actually, the space bank retains sufficient authority to misallocate pages, but it is part of the TCB, and it honors the conceptual contract of having sold you the page.

Nodes are allocated through a space bank as well, but the kernel is the only entity trusted to manipulate the data within the nodes for security reasons. The kernel is simultaneously the creator and the sole client of the actual capability data.

For example, if kernel data is replicated 3x, you might want 3x or 2x replication for some user-level objects, and none for others. Does KeyKOS support this?

Yes. Storage of nodes and pages is completely symmetrical, and can be configured to whatever degree you wish in both cases.

The KeyKOS user-level process model (and, in general, the traditional model) is that servers directly "own" all the resources they use, even the resources they use on behalf of their clients... The alternate, less conventional model is that of clients "supplying" the necessary resources for the servers to do their jobs: the server runs on the client's CPU time, charges the client for server memory allocated on its behalf, etc.

Consider a pernicious client process that has created a meter with no keeper and allocate 5 ticks of runtime to that meter. The client calls a shared service which runs on the client's meter. The quanta runs out midway through the service, and the client perniciously refuses to refill the meter. Denial of service.

Shared services cannot in general trust client-provided resources unless, as in the case of page keys, the have some mechanism by which to ensure safety.

The interesting thing about KeyKOS in this regard is that it is schizophrenic. It supports the first model for relationships among user-level programs, but the KeyKOS kernel itself, considered as a "server" whose clients are all the user-level processes in the system, is designed very strongly around the second model.

If you look at the kernel services carefully, you discover that the kernel performs certain well defined manipulations on user-provided resources, but does not make use of these resources itself. The KeyKOS kernel is pretty much stateless.

In one small respect this is not true. The kernel conceptually runs on no meter. In practice, it runs on the client meter, and takes advantage of its privileged status to "stretch" the quanta enough to complete the requested kernel operation.

Jonathan