Changes in EROS Version 2.0EROS Goes Microkernel A number of issues, including ease of driver porting, a strong desire for embedded real-time support, and the need to build a more flexible and configurable OS platform have led to significant design changes in what will become EROS version 2.0. This note describes the motivation, the changes that have been made, and the issues that these changes raise for developers. IntroductionSeveral needs have been converging for some time in the EROS community:
In brief, meeting these goals essentially required that we move drivers from the kernel space to application space. User Space DriversUnfortunately, moving drivers to user space carries consequential damages: it breaks the persistence model. If the core disk drivers are not included in the kernel, then the kernel cannot rely on them for startup purposes, and therefore must not rely on reading the checkpoint area to get going. I briefly considered retaining disk drivers in the kernel, and moving only the other drivers out. This proves to be an undesirable hybrid, because keeping "only" the disk drivers is non-trivial. A SCSI disk, for example, carries a lot of supporting code. Supporting "only" SCSI disks would mean incorporating complete SCSI bus, device, and logical unit support into the kernel. There is then the problem of exporting canonicalized interfaces for all other devices on the SCSI bus. Worse, SCSI is a hot-pluggable interface, so the kernel would need to deal with hot plug policy. Ironically, this last point proved to be the final straw for the EROS v1.0 design. In thinking about the rising popularity of USB, FireWire(tm), wireless, and similar ad-hoc connectable bus technologies, I concluded that keeping the supporting policies out of the kernel was important, because it would need to change frequently. For these and other reasons, drivers in EROS v2.0 will be user-space applications. The bootstrap code has been modified to preload ranges containing the driver code, and the kernel has been modified to start a single IPL process that in turn starts all other drivers. That is, the kernel no longer has any direct knowledge of disks, disk-level object management, or storage representation. PersistenceRemoving drivers -- in particular the disk drivers -- from the kernel immediately led to a second problem: how would checkpointing and persistence be implemented for those system configurations that needed them? This raised issues about where to make the "cut" between the kernel and the user-mode object store, and how to ensure that deadlocks did not occur between the kernel and the object management code. Getting this interface appropriately defined proved tricky. Mixed-mode ProgrammingOne consequence of moving drivers to user-land was the need to introduce a non-persistent programming model for EROS. To avoid deadlock, the kernel must not attempt to checkpoint the state of drivers. If, for example, the disk driver state were checkpointed, the kernel could conceivably get into a corner where the disk driver data had been marked copy on write but no page frames are available. In this circumstance, progress cannot be made until the disk driver runs, and the disk driver cannot run until the copy on write completes. Oddly enough, the deadlock problem above is probably solvable. The issue that convinced me to make drivers non-persistent was initialization. It is extremely awkward if drivers restart from partially checkpointed states, because you then cannot rely on the driver to correctly bootstrap the rest of the system. Ultimately, the solution was to introduce a non-persistent programming mechanism. The kernel now can be compiled entirely without persistence. If the kernel compilation includes persistence support, it is now possible to have ranges of objects that are excluded from persistence and/or ageing. Objects excluded from persistence will remain resident in memory once dirtied. Objects excluded from ageing will remain in memory whether or not they are dirty. Non-persistent regions create still further complications: If a persistent process calls a non-persistent process directly, it can be checkpointed before the non-persistent process returns. If the system then crashes, the resume capability to this process can be lost, with the end result that the calling process will never restart. New mechanisms have been introduced to manage the boundary crossing between persistent and non-persistent regions. Physical Memory RegionsUser-space drivers have also dictated a need for some way to publish physical memory for use by selected applications. Without knowing the physical addresses of certain pages, drivers cannot correctly program DMA subsystems. A well-known range of OIDs has been reserved for physical pages, and a new (user-mode) physical memory allocator has been introduced to permit drivers to allocate contiguous physical memory regions. This is described in greater detail below. Object SourcesA reasonably pleasant solution to several of these problems emerged in the introduction of object sources. In EROS version 1, we used to say that the in-memory object cache was a cache of the disk ranges as modified by the checkpoint area. In EROS version 2, we now think of the object cache is a cache of object sources. An object source provides a relatively simple interface:
While there are a few other supporting methods, these are the main ones. How Object Sources are ManagedThe kernel maintains a list of ``attached'' object sources. When searching for an object, this list is of sources is consulted in order from first to last. The first source that agrees to produce (consume) the object is accepted. Thes list always begins with the Other object sources implemented by the version 2 kernel are:
Collectively, these mechanisms provide sufficient support for embedded and persistent systems within the same kernel. User Mode DriversA more detailed description of user-mode driver support is given in User Mode Drivers in EROS 2.0, but we will synopsize the key supporting points here. First, the bootstrap code has been enhanced to preload object ranges that have been marked with the In addition to the Application and Driver StartupBecause EROS may now run in diskless environments, it is no longer possible for the kernel to consult the checkpoint area in order to determine what processes should be started at boot time. In fact, the responsibility for these restarts has been moved out of the kernel entirely. In EROS v2.0, the idea of the "IPL Key" has been resurrected. The IPL Key is a process key stored in the boot sector of the boot disk. It is passed along from the bootstrap code to the kernel as part of the bootstrap logic. Once the kernel is initialized, it starts the process named by the IPL Key. The kernel assumes that this process should simply be set running and allowed to execute instructions. In our current tutorials, the most common use of the IPL Key is to include a copy of The IPL Tool works in cahoots with the image builder (the mkimage program). The mkimage program builds up a linked list of nodes whose contents are process keys. Wherever mkimage encounters a In a purely embedded system, the IPL tool will likely start all system processes. In a persistent system, the IPL tool will probably be used to start up the persistence layer, which may in turn start other processes. Driver Rendevous
The final enhancement in the the EROS v2.0 kernel architecture is the redesign of Device capabilities. In the v1.0 kernel, devices were implemented in the kernel, and a device capability was ultimately bound to a particular piece of hardware. In the v2.0 kernel, device capabilities simply reflect requests up to a responsible driver. The reason that device capabilities exist at all is the problem of thread loss. If a persistent process is ever permitted to directly invoke a non-persistent process, and a checkpoint occurs at the wrong time, the persistent process might never awaken. To resolve this, we have introduced in-kernel device rendevous objects. Device capabilities now name these rendevous objects. Rendevous objects provide a reflection service. The driver calls the provider half of the rendevous object with a Revisions to Checkpoint MechanismIn the EROS v2, checkpointing can be entirely compiled out. This may be appropriate for two types of applications:
If checkpointing is included in the kernel build, the division of labor changes a bit relative to the earlier kernel version:
A new capability will be introduced that will allow a user-level object source to read and write objects into the object cache. This is a security-critical interface, because it essentially enables an application to convert data (disk frames) into capabilities by creating node objects from raw bits. A fundamental assumption is being made here, which is that the wielder of the object source capability is trusted in the same way that a disk driver would be. Open IssueThere is an open issue in this design: placing the checkpoint directory in user level code may introduce an unacceptable performance delay, because the "fault on first modify" that is performed to verify that an object has enough space in the checkpoint area must now be reflected to user mode. This is potentially a bottleneck -- especially when restarting the system following a snapshot. For the moment, I'm going to try things as-is without doing any optimization against this potential bottleneck. If this proves to be a problem, the interface to the application-level object source can easily be redefined to move the responsibility for the checkpoint accounting back into the kernel. Note that the kernel does not need to know where the objects are going. It only needs to know the bound on the number of allowable dirty objects at any given time. Kernel MallocThe move to user drivers carries with it the long-avoided adoption of a dynamic memory allocator for the kernel. This is described in the Kernel Malloc design note. Copyright 2001 by Jonathan Shapiro. All rights reserved. For terms of redistribution, see the GNU General Public License |