Dependency Tracking in the EROS Kernel

There are two sorts of dependencies that the EROS kernel needs to keep track of:

  • Relationships between objects and the prepared keys to those objects.

  • Relationships between the segment structure described in nodes and keys and the hardware mapping entries built from those structures.

1. Prepared Key Dependencies

In the interest of efficiency, EROS capabilities make use of a pointer swizzling technique. When a key first comes off the disk, it is said to be unprepared. An unprepared key contains an allocation, object identifier, and a few other fields:

1.1. Preparing a Key

The first time that an unprepared key is used, the key is prepared. Preparing a key involves several steps:

  1. If not already in core, the object named by the key is faulted into memory.
  2. If no object table entry exists for the object, one is allocated.
  3. The key is converted to prepared form.

A prepared key points to the object table entry, which in turn points to the object:

By using an indirection table design, the IPC path can usually copy keys without regard to whether they are valid, so no memory reference is required to anything other than the key itself in order to copy a key. This proves to give a significant improvement in IPC times.

To handle object rescind and object frame reuse, EROS reserves somewhere between 10% and 15% more object table entries than there are object frames in core.

    Design Note: It is unclear how many extra indirection table entries are needed. It may be that as low as 1% would work fine in most cases.

    An indirection table entry is allocated under the following circumstances:

    • An object is rescinded.
    • A key to a previously unreferenced object is prepared.

    The rescind operation need not be terribly efficient. The key prepare case, however, is complicated by the checkpoint mechanism.

    When a checkpoint occurs, approximately 25% of the objects in memory will prove to be dirty. Mapping entries to these objects will be rendered read-only, and the objects will be marked as hazarded by checkpoint (which prevents mutation).

    When a process then attempts to mutate one of these objects, a copy-on-write operation will occur. This causes the object table entry to be updated to point to the new version of the object, at which point execution can proceed.

    The major source of overrun, then, is due to the interval between time an old, clean, free object is replaced by a newly loaded object and the time the scavenger releases the object table entry for that old object. Contrary to what you might expect, the number of entries required is a function of the incoming object fault rate rather than the pageout rate.

    Note, however, that a complete scavenge of memory can be done every 1000 object faults or so without terribly significant overhead.

1.2. Removing an Object from Memory

When an object (a page or node) is removed from memory, there may exist prepared keys to it. The keys are invalidated by invalidating the object table entry:

A stale OT entry can be detected by the fact that the least significant bit of the pointer field is 1 (all valid pointers are word pointers). If bit 0 of this field is 1, then bit 1 indicates whether the object was removed or rescinded, and bit 2 is used by the OT scavenger.

As nodes are forced out to disk, the keys in them are deprepared. If a prepared key is found to point to a stale OT entry at deprepare time, what happens depends on whether the object was removed or rescinded:

  • If the object was removed, the key is deprepared back to it's original form using the OID value stored in the OT entry.

  • If the object was rescinded, the key is deprepared to the null key.

Both object removal and object rescind are low frequency events. Unreferenced OT entries are returned to the free OT pool by an OT scavenger.

Hazards

The removal strategy depends on being able to locate all dependencies on the object being removed. For pages, page dependency tables are maintained that are described below. For nodes, the dependencies on the node are completely embodied in the dependencies on the key slots, and a key slot dependency table is used.

1.3. Call and Allocation Counts

One design problem in single-level-store capability systems is the problem of invalidating stale capabilities when objects are allocated to new users. EROS uses a 48-bit allocation count to accomplish this. Both the object and the key contain a 48 bit number indicating the number of times the object has been reallocated. If these numbers do not match, the key is invalid.

Objects are not reallocated very often, so the allocation count value is not unduly threatened by overflow.

The EROS system guarantees that every call operation will receive at most one return. This is accomplished by keeping a count of the number of calls a domain has performed, and placing in each resume key the count associated with the corresponding call. If the resume key count does not match the domain's current call count, the resume key is invalid. Because the call count must be incremented with each invocation, call counts are threatened by overflow. We have therefore chosen a 48 bit allocation count.

A 48-bit count is large enough for the next two decades. Assuming a 32 cycle round-trip invocation time (an implausibly small number), such a counter takes 253 cycles to roll over. On a machine with a femtosecond (2-15) clock, this works out something over 8 thousand years. While it's conceivable that a single system image might run that long, we feel reasonably confident that we can scavenge the persistent store every 2 thousand years without noticable overhead.

This calculation, of course, makes the optimistic assumption that memory references will take only 1 cycle.

1.4. Alternative Designs

In KeyKOS, all prepared keys to an object are kept on a circularly linked list. If the object is removed, this circularly linked list can be walked to deprepare all of the keys. While simple and elegant, this design suffers from two limitations:

  • The list update tends to imply a cache miss per prepared key in the key invocation (IPC) path. While the key being sent is likely to be in the cache, it's neighbor in the linked list is not. This causes a cache miss on essentially every prepared key copy, and accounts for nearly 50% of the invocation time.

  • The linked list imposes 25% space overhead on all in-memory keys (whether or not they are prepared), because space for the linked list pointer must be present in each key. In addition to the issue of space, it is inconvenient for in-memory keys to be a different size than on-disk keys.

Because of these issues, EROS has abandoned the key chain approach.

2. Mapping Entry Dependencies

EROS constructs memory segments as a tree of nodes and pages. Whenever a mapping entry needs to be built (discovered when a page fault occurs), the segment tree is traversed from top to bottom to build the needed entry. At each layer in the segment tree we traverse through some slot in a segment node. Four actions may force us to invalidate existing mapping entries:

  1. A key in one of the slots we traversed may be altered. All mapping entries predicated on the old value of the key must be marked invalid.

  2. An node named by one of those keys may be removed from memory. All mapping entries dependent on that object's presence must be marked invalid.

  3. An page named by one of those keys may be removed from memory. In this case, there is at least one mapping entry that directly references the page that must be marked invalid.

  4. A page frame that is acting as a mapping table may be reclaimed, in which case page table entries that reference it must be marked invalid.
In KeyKOS, the Key chain enabled cases (2) and (3) to be reduced to case (1). KeyKOS preserved a mapping from key slot addresses to mapping entries. Before an object was removed, all Keys to that object were unprepared, which guarantees that all of the associated mapping entries were invalidated. To handle case (4), KeyKOS simply added an extra entry in the depend table based on the mapping table address.

Because it does not maintain a Key chain, EROS must adopt a different solution.

2.1. Managing Slot and Node Dependencies

In principle, key slot dependencies can be captured in three ways:

  1. By tracking dependencies between the key slots and the generated mapping entries, or

  2. By tracking dependencies between the containing node slot for each key and the mapping entries generated by that node.

  3. By tracking dependencies between the object slots (nodes and pages) pointed to by the keys and generated mapping entries.

Options (A) and (B) are equivalent if the mappings are discarded in a way that takes into account which slot was altered. Option (B) has two advantages: it slightly simplifies Node removal and is somewhat more compact. I had initially hoped to adopt this approach in EROS, but it proved much harder to code than I had expected. Basically, it's easier to consolidate entries of the form (Key *, PTE*) than it is to consolidate node entries.

Every time we traverse a key slot, we add a depend entry for that key slot which names the mapping entry we were in the process of building when the traversal occured. If a slot in that node is later altered, we invalidate all of the mappings shadowed by that slot.

On machines implementing tree-structured mapping, it is usually possible to consolidate multiple entries for the same key. If two different mapping entries in the same tree page were generated by the same key slot, we can usually assume that all of the intervening mappings were also generated by that key. There are perverse mappings under which this is not true, but the impact on these mappings is that they will get invalidated more often if they are changed. We accept this performance impact in the interest of efficiency.

Given all of this, the data structure used to capture the key slot dependency information takes up 2 words on a 32-bit architecture:

struct KeyDependEntry {
  PTE      *start;        // First entry to zap
  Word     pteCount : 12; // Number of entries
  Word     slotTag : 20   // hash of key address
};

This assumes a 1024 set depend entry cache, and takes advantage of the fact that key addresses are word aligned.

Dependencies for Hash-Structured Hardware Mapping

Machines implementing hash-structured mapping do not readily lend themselves to the entry consolidations available under tree-structured mapping schemes. Two tricks can be used to achieve a similar effect, and these tricks interact to mutual benefit.

First, hash table entries are allocated in chunks of 8 adjacent virtual pages. This gives them some locality of reference, which enables a degree of consolidation in segment nodes of smaller degree. Key slots of larger degree must be captured using a different structure:

struct NodeDependEntry {
  uva_t      startAddr; // First entry to zap
  uva_t      endAddr;   // First entry to zap
  asid_t     asid;      // Address space
  Key        *node;     // Node address
};

The range of virtual addresses is used to walk the hash structure by hand.

2.2. Managing Page and Mapping Table Dependencies

To solve parts (3) and (4) of the dependency mapping problem, we could adopt the solution that KeyKOS used for mapping table pages to handle user pages as well -- just add the address of the page frame into the depend table with a back pointer for each mapping table entry. While this would work, it requires 4 words of overhead for every page mapping (which is most of the in-core pages) where roughly 1.25 will do the job. The l2span field would always contain 0 in such cases, and the mapping table entry itself contains the address of the page frame. Given this, we instead use a smaller structure:

struct PageDependEntry {
  PTE  *entry;         // Entry to invalidate
};

Note that this structure is pretty much independent of the mapping architecture. While the mapping entry layout varies from machine to machine, the notion that there is one per page appears relatively universal.

2.3. Multiset Cache Rather than Hash Table

Rather than adopt the KeyKOS hash table design, we decided to go for a multiset cache style of table management. We expect that the size distribution will turn out to have a low standard deviation, which makes the next pointer unprofitable. In addition, Shapiro's experiences with TLB modeling strongly suggest that the outlier cases can be handled with a fully associative overflow cache.

3. Domain Root Dependencies

Nodes that act as General Registers, Key Registers, or Domain Roots have a special relationship under the domain contract. If any of the components of a domain is removed from memory, the domain must be deprepared. This problem is unrelated to the issues mentioned above, but is included here so that this note will cover all of the dependency management issues.

The domain contract problem is straightforward: the nodes in a domain are prepared and deprepared as a group, and the two auxiliary nodes contain pointers to the root node. This means that if any of the three is removed, the others can be found and deprepared at the appropriate time.


Copyright 1998 by Jonathan Shapiro. All rights reserved. For terms of redistribution, see the GNU General Public License