Jonathan S. Shapiro
November 23, 2003
The EROS project has now proceeded far enough that we know EROS will ``work.'' We can look ahead and see the end of EROS as an academic research effort. We have, for example, internally validated the performance of a network stack designed around the notion of ``defense in depth,'' and constructed a trusted window system in a very small code base. We have also validated that many of the key design ideas in KeyKOS yield useful idioms in real practice, and that some others are problematic. I think that we have gone considerably further than KeyKOS did in some respects.
So the EROS research group is now confident that the system could be finished to the point where real users could use it. If we conclude that EROS is a fundamentally good design, we should do so. Otherwise, we should start asking how ideas from EROS can be migrated into more conventional systems. So the question we now need to ask is: should EROS be finished?
EROS started its life as an attempt to reverse engineer the KeyKOS system. The first version (never released) was a fairly faithful reimplementation of KeyKOS. The second (the C++ version) was a rebuild in C++, faithful to the original but beginning to experiment in various areas. The most recent version is a significant departure, based on a microkernel design and adding some significant departures from the KeyKOS design.
In the discussion below, I'll refer to EROS, but with very few exceptions all of what I will say applies equally to KeyKOS.
Dennis Ritchie (below: DMR) gives an excellent talk in which he asks ``What were the fundamental ideas in UNIX?'' I think it is useful to likewise identify the fundamental ideas of EROS, because the success or failure of these ideas will determine whether EROS, in the end, is a viable operating system architecture for the real world. I also think that it is enlightening to contrast these ideas with their counterparts in UNIX:
Everything is an Object. Conceptually, all entities in a running EROS system are objects that are implemented by some server process. Kernel-implemented services attempt to preserve this conceptual illusion. From the programmer perspective (and, to a large extent, the kernel perspective) there are no unmediated bits in the system. Fundamentally, the majority of the objects in the system are ``active,'' by which I mean that they execute instructions.
Everything is a File or a Process. The most basic underlying concept in UNIX is the file. In general, bits are not mediated, and the basic conceptual role of a program is creating, deleting, and revising files rather than interacting with objects that have behavior. Many of the key objects in the system are ``passive.''
Objects are General. An object may implement essentially any interface, and is effectively limited in its interactions only by some awkwardness arising from the strong single-threaded execution model.
There is a generalized notion that objects have persistent identity (see below) and that objects act as the key services in the system.
Everything is a Stream. DMR would say this as: ``everything is a file,'' but I am trying to distinguish here between the interface (Stream) and the storage structure (File). The fact that all objects obey a common interface facilitates connecting objects together in interesting, flexible, and creative ways without extensive prior planning.
While CORBA has emerged as local connective tissue in some applications (e.g. Open Office), no generalized notion of ``objects as services'' has emerged in UNIX yet.
There is an implicitly persistent, unified memory model. Main memory has the same semantics as disk storage, and survives system shutdown/restart. As part of this, there is a global (machine-wide) consistency model that allows objects to share a common understanding of their mutual states without heavyweight transactions.
There are two memory models: persistent and ephemeral. Main memory is ephemeral, and programs must make special provisions to get their state preserved across system shutdown. Consistency must be managed explicitly, even within a single node.
Everything is a Capability. All object references are performed via capabilities. There is no notion of ``principal,'' which makes auditing relatively difficult to do correctly.
Persistent Objects are ACL-based, ephemeral objects are capability based. Strictly speaking, neither statement is true. The user/group/other notion isn't strictly an ACL model, and there are namespaces in UNIX, notably sockets and processes, that are not referenced by descriptors. All of these may be viewed as historical artifacts that could readily be repaired without changing the basic UNIX model (as was largely done in Plan 9).
Metadata is Exposed. The composition of address spaces and processes from nodes is explicit. This is driven by persistence and is usually value-neutral, but introduces the possibility that objects may come to be composed from mixed storage sources, which is a semantic disaster.
Metadata is Hidden. The process control block is a primary kernel concept, and the address mapping data structure and file metadata structure representations are private to the operating system. This permits the kernel to leverage some amount of higher-level semantic knowledge in the interests of performance.
Process rendevous is pass-through. The key IPC rendevous mechanism is that an active caller calls an inactive recipient. This is necessitated by the persistence model but makes multithreading quite difficult to manage in certain situations.
Process rendevous is peer to peer. The basic communication mechanisms involve two active threads of control coming together at some queue. The transferred data may be buffered by the kernel, but it fundamentally isn't going anywhere until the recipient asks to receive it.
I'll try to explore the implications of each of these issues below.
The ``everything is an object'' notion of EROS is one of its greatest strengths. It ensures that interactions between programs are based on interfaces rather than representations, and it guarantees that data is always mediated by code. Because they are persistent (see below), objects can hold authorities that their clients do not, and this ability to generically ``wrap'' capabilities with mediating code is exceptionally powerful. It enables the security of the EROS space bank (which wraps a range capability), the constructor (which wraps a brand capability), and the virtual copy space keeper (VCSK, which wraps a sensory node capability). It also is the basis for supporting proprietary content -- an application or component can hold content that its client cannot directly access.
The ``everything is an object'' notion also carries some hidden costs:
Because objects are defined by their behavior, they are inherently non-portable. While it is possible to arrange a proxy that enables remote access to many objects (though not all), the bits that make up the object representation cannot be migrated across processor architectures in general. At the most primitive level, two machines that do not agree about page size must take extreme measures to share data access across a machine boundary.
Another way to look at this is that the EROS design takes a polarized position on the ``function shipping'' versus ``data shipping'' debate: remote access can in general be obtained only by shipping requests to the host containing the object in question.
To be sure, this problem can be solved in application level code much as is done in UNIX. The point is that if you need to solve it this way, you end up losing the ``everything is an object'' discipline that makes the EROS design attractive. If so, then it is appropriate to ask whether this was the right primitive notion.
It is not straightforward to ``copy'' an object unless the object is running, and in general an object may not have sufficient access rights to its own metadata to perform a faithful copy, because it may not have access to the capabilities that comprise the metadata of the object. It is sometimes difficult to understand where copying should stop.
As an example, consider the problem of restoring a ``file'' object from a backup image. Imagine that we resurrect the backup image of last week in a parallel universe, and that we find some means to locate the desired object and ask it to copy itself into storage from the currently running system image. The object we wish to copy has many dependencies: on its address space (therefore keeper, space bank), on its keeper (and that address space and space bank), on its constructor (and therefore process creator), on metaconstructor, and so forth. There is no obvious perimeter that can be drawn around the object dependency graph.
For simple objects, this problem is readily resolved: if the object provides a mechanism by which it can serialize itself into a data representation, we can imagine using a serialize and deserialize to copy the object into the currently running image. We can probably arrange enough mutual authentication between the old object version and the new object version to avoid inadvertent disclosure of the object's representation bits. However, copying authorities forward across time is tricky at best. The serialization approach can be generalized up to a point -- we could define a recoding discipline to describe the object identities and relationships of the old object -- but at some point this discipline will become unwieldy and impractical.
The consequence is an object model in which some objects are recoverable from backup and others are not. This seems very hard to explain sensibly to a user. A corollary of the object graph dependency non-boundary is that upgrade is very difficult to think about.
An EROS capability conforms to an interface, but security and implementation dependencies are more often on implementation rather than interface. When two objects have a trust relationship, this relationship is a relationship between implementations rather than a depdendency on an interface. In practice, special handling is needed to manage ``arm's length'' trust relationships. The issue is somewhat complicated by thread-migrating IPC because of issues of of ``thread capture'' (see Vulnerabilities in Synchronous IPC Designs).
In order to decide whether the ``everything is an object'' notion is worth the challenges that it creates, we have to talk about persistence, so onwards.
EROS objects can implement essentially arbitrary interfaces. This effectively addresses one of the basic problems (and strengths) in the UNIX design: a UNIX file is just a big bag of bytes (see Eric Raymond's The Art of UNIX Programming for further discussion on this). It is instructive that the UNIX file model has had to stretch significantly over the years, and that there are now a great many types of files in the UNIX system. On the other hand, UNIX has stretched very well. While CORBA has emerged as local connective tissue in some recent applications, no generalized notion of ``objects as services'' has emerged in UNIX yet.
In contrast, UNIX objects work very hard to all implement a common signature: that of streams. The basic interaction between a process and the objects it manipulates is performed via the read and write system calls. This, combined with the strong UNIX convention for using text files, provided UNIX with a great wealth of tool composability. The questions that we need to ask are:
Given that this model change imposes significant mental overhead on existing programmers, is the generality of the EROS object model sufficiently well motivated to give up this general composability?
If there is to be a singleton basic interface (or any small fixed number of them), is stream the right interface, or should we consider something else?
The Plan 9 system has generalized the UNIX notion of common interfaces from one to two: objects can implement either the stream interface or the namespace (9p) interface, or possibly both. This is surprisingly powerful, because it supports much of the differentiated interface ability of capability systems (alternate interface names can be published within per-object directory trees), and can also be used as a basis for interface polymorphism by overlaying RPC operations on top of the basic stream interface once a connection between client process and server has been established.
One thing that Plan 9 (and UNIX) lose in this model is a certain amount of ``objectness'' (for lack of a better word). The need to explicitly perform ``open'' operations to establish sessions to objects tends (in my mind, at least) to interfere with the notion that objects are ``just there to be invoked.'' However, the intrusiveness of this depends greatly on the longevity of the objects in question. If the object relationships are long lived, then the need to open (or clone, or instantiate) an object doesn't intrude that much on the mental programming model. If the object is shorter lived, then the open call begins to invade the mental model.
Another issue with sessions is that they don't naturally lead to a remote procedure call mindset. RPC mechanisms built on top of streams tend to be horribly low performance, and this tends to destroy the object interaction model. I would argue that the pivotal difference between client/server systems and encapsulated object systems lies primarily in the programmer's expectations about RPC latency. This expectation heavily influences the least unit of work that the programmer believes can be reasonably implemented in an out of process service.
Persistence is arguably the most distinctive feature of EROS. A partial list of the arguments for persistence are:
Claim: Persistence eliminates the need for serialization code in most applications.
This appears to be much less true than I originally believed. While there are certainly EROS components that do not need the ability to specifically serialize their data, applications in general require the ability to serialize their state for purpose of interchange (file export) and for network communication purposes.
Norm Hardy has offered the record collection object as a counterexample, but I don't find this counterexample compelling. The record collection may not know how to serialize to bytes, but it certainly requires an iterator method that allows its client to do so. This moves the functionality around, but it doesn't eliminate it.
Claim: implicit persistence is the key enabler to protected per-instance authority.
This appears to be true. I can imagine alternative implementation strategies for achieving per-instance authority management without implicit persistence, but all of the carry truly horrendous amounts of overhead and would be practically unwieldy. Taken overall, if you decide that you need instance persistence, I believe that the current EROS strategy is the ``right'' way to go about it.
Claim: the persistence model justifies our statement that the state of the system over time results from correctness preserving transformations from an initially correct state.
This argument is essential to the current EROS security model. The model has some key advantages: it is simple, relatively easy to reason about, and supports higher level security policies. It may not be the only model with these properties, but the experiences that people have had while attempting to build other models has been pretty discouraging.
Implicit persistence carries with it a significant number of architectural requirements that selectively favor certain programming idioms. I will discuss this below in the discussion of thread rendevous. Some of the idioms that are disfavored are very popular, and will impede acceptance of EROS (and portability of code): in particular, the weakness of multithreading and the inability of EROS to support anything comparable to select(). If we wish to bring up any significant number of applications within our collective lifetimes, we need to look at ways of revising the architecture that might address these limitations.
I am increasingly concerned that implicit persistence and network interactions may be at odds. The client/server model works fine in EROS -- it is basically a model whose interactions are between two nodes that have (presumptively) independent failure conditions. This model survives in EROS very nicely.
The other model is the network file system model. As far as I can tell, this model doesn't survive in EROS at all, basically because files are legacy artifacts in the context of the EROS design. Unfortunately, transparent remote object invocation does not adequately replace this model because EROS object invocations, as currently designed, do not address issues of consistency failure.
I confess that I have very mixed feelings about network file systems. On paper, they ought to be a great idea. In practice, the fact that all current network file systems are "server authenticates client" as opposed to "server authenticates client user" is a serious flaw. Pragmatically, I also find that laptops really don't get along with the network file system model -- intermittent connectivity doesn't fit the model, and it seems to be the main operating mode for people on the go. This becomes ever more true with devices like handhelds. In my view, something is going to have to replace file systems, but we're going to need to do that in a way that allows data to migrate across architectures. It's clear that this can be done on EROS, but the solution seems necessarily built on finding ways around the persistence boundary problem. If such methods can be found, they can be used in the local case too.
Everybody in the EROS community is a capabilities fan. We wouldn't be here if we weren't. Capabilities offer some major advantages, and they directly support the object-oriented idioms for protection and designation that I think we all favor.
That said, capabilities present some real problems for audit. For audit purposes, there are situations where we need to be able to answer the question ``On whose behalf was this operation performed?'' This is for purposes of recoverability and traceability, not for purposes of access control. From the standpoint of audit, the answer to the confused deputy problem is to architect such things out of the system, not to provide more effective support for confused deputies.
The key problem with capabilities is the fact that they seem to require persistence for effective use, and persistence in turn creates some idiom issues.
EROS fully exposes metadata. The use of nodes as constructive data structures is fully exposed, as is the page size. I now consider this to have been a fairly fundamental mistake, but I do not yet have an alternative. The problem is that applications are free to use these data structures in arbitrary ways, and the ability of the system to enhance performance by access prediction is badly compromised as a result. The current EROS design is not robust against changes in page size or desire to share state across architectures. Further, the design facilitates the construction of objects whose storage originates in distinct space banks, leading to partially broken objects. This misfeature interacts badly with the persistence architecture: there is no way to notify interested clients that objects are defective. This last point needs expansion.
The notification problem arises from the fact that there is no way to efficiently back-walk capabilities. If a process disappears, there is no way for its clients to discover this until their next invocation of that process. This is in contrast to a session-based design, where the close of the session can be expressed to the clients as an event. This amounts to a failure of death notices, and the failure of death notices leads to difficulty in implementing recoverability on the one hand and orderly, prompt teardown on the other. The system is consistent in the kernel's-eye view, but it is not consistent in the view of the application and there appears to be no good way to recover.
The question is: is there an alternative means of managing metadata that is sufficiently precise about resource consumption but avoids exposing the underlying implementation? There is also a question about communication, but let me deal with that in the next section.
The basic IPC mechanism in EROS consists of a running (active) process invoking an available or waiting (passive) process. While this model is simple, it has emerged as a recurring problem in the design of higher-level idioms:
It makes multithreading exceptionally difficult. While I agree with the KeyKOS team that multithreading is overused and frequently dangerous, it is too strong to implement a de facto policy that multithreading is entirely unsupported.
A passive rendevous IPC design cannot do a good job on multiprocessors, even with the proposed RETRY mechanism. The problem in a nutshell is that the activation ends up on the wrong CPU, and that the receiving application has no way to demultiplex behavior across CPUs. This is a fundamental weakness of the EROS IPC model.
It is very difficult to manage certain kinds of communication patterns that are readily supported in the active rendevous model (e.g. with select()). In particular, it is not possible in EROS for a single thread of control to poll multiple event sources and sinks simultaneously. In consequence, it is not possible for a service to effectively load balance its workload.
The notion of Scheduler Activations is impossible to support. Scheduler activations provides a principled way of handling event-based preemption, and has proven essential in the underpinnings of Nemesis. This notion has also been experimentally implemented in NetBSD.
The essential reason for passive IPC in EROS is the process table. In a system based on active rendevous, all processes are ``running'' (queued for an event, but logically running) most of the time. In a system such as EROS where processes run for a very long time, this introduces fundamental scaling pressure on the process table and the system run list.
The effective size of the run list is limited by the size of main memory, because some minimal amount of per-process state must remain in memory in order for a process to remain queued. The KeyKOS decongester design doesn't really help here; if a process is removed from its kernel queue then it isn't around to rendevous with, and if the kernel cannot remember the association between the queue and the process, it becomes exceptionally difficult to wake the process up. What we need to scale this is pageable queues, as distinct from pageable processes. It is not clear how to build such a thing.
Actually, I think that I probably do know how to build fully pageable queues, but introducing a queue-based rendevous mechanism into EROS is a fairly fundamental change in the communication model. If service processes were to operate by ``publishing'' request queues, then we suddenly need to ask how these request queues are named. It may be that the start key notion was a mistake.
There are three changes to EROS that I think need to be considered:
Explicit Communication Channels.
After a lot of thought, I think that the current EROS invocation model is broken. A better design would be to replace the ``start key invokes process'' notion with a ``start key names a mailbox'' notion. When a new process is created, it would not return a start key to its creator. Instead, it would create a communication rendevous object and provide a capability to that. The rendevous object exists independent of its serving process, and multiple processes (or threads) can serve a mailbox.
In this model, the ``reply and wait'' operation would be replaced by ``reply and poll'', where the poll operation identifies a set of mailbox objects from which the application is prepared to receive. The ``set of'' part might be a mistake -- I still need to think about this.
A corollary to this is the need for reference counting on communication channels. When the last servicing process disappears, the communication channel needs to be visibly shut down.
The microkernel-based EROS system is non-persistent. It's not far enough along yet to know if this is acceptable, but it is certainly thought provoking. Abandoning persistence would make us more closely aligned with more conventional systems, and would make some network interoperation issues easier.
Norm reacts to this a bit violently. He asked (on the phone) ``How do you rebuild the state?'' to which my reply was ``Why should you rebuild all the state?''
While I really do understand the attractiveness of full recoverability, I'm not sure that I want it if the price is inability to federate machines effectively. Perhaps there is a way to have our cake and eat it too, but at the moment I do not see it. The harsh truth is that most of the state on my desktop at any given time is truly garbage, and that I'm not particularly unhappy to lose most of it in the event of a power loss.
It may be that I am stuck here on the objects vs. streams issue. Objects work very nicely as a model for local-case interaction. Streams work much better for remote interaction. It may be that persistence isn't the problem, and that the introduction of a mailbox model could be extended to address what I'm groping (blindly) for here.
The main issue with abandoning persistence is the problem of per-process authority. I'm attempting to explore the possibility that this may not be crucial. Here is the essential argument:
At a certain level of abstraction, EROS may be seen as an ACL system. While the kernel certainly doesn't do principals or ACLs, the login agent does, and the session that it restores to me may be viewed as a set of accessable objects that are authorized on a per-principal basis.
So here is a radical notion: the problem with ACLs is not that they are principal-based. The problem with ACLs is that they exist in a universally shared namespace. That is, they would be survivable as a system protection primitive if the predicate for access was:
The first is, in effect, the capability constraint. Perhaps we did not go far enough in our earlier discussion about per-process name spaces. In respnse to Eric Raymond's comments in that note, I observe that turning Linux into a capability-based OS and turning Linux into a persistent OS are two largely orthogonal objectives.
If we have a first-class interprocess communication primitive that permits transmission of unbound capabilities, we can build a significant non-persistent capability system.
In practice, few (if any) of the security idioms that are emerging in EROS appear to rely on per-process authority. More precisely, all of the ones that really do are singleton implementations. For the space bank, it would be sufficient to express this authority as something associated with the singleton instance of the program.
The power box idioms of E appear to be working fine without persistence. Given plan-9 style namespaces, how far might we be able to get with just that?
Re-introduce something like a file system.
Though more in the style of Plan 9 than UNIX. If we make the name space root descriptor an explicit argument to the exec(2) call, a great deal might suddenly become very manageable. There is no question that a system designed in this way could be used badly, but it is possible to use it correctly, and it is possible to transition with it.
As part of this, I am thinking that files may want to be typed, and that certain file types may want to have ``handlers'', by which I mean that performing an ``open'' on a file implies instantiating a running copy of its mediator application and establishing an RPC connection to that. This is a very raw idea and I am still playing with it.
Now before Mark Miller panics, let me emphasize that I am exploring ideas here. This should not be taken to mean that I am abandoning EROS. One very likely outcome of this note is that we will find solutions to many of these concerns.