Dependency Tracking in the EROS KernelThere are two sorts of dependencies that the EROS kernel needs to keep track of:
1. Prepared Key DependenciesIn the interest of efficiency, EROS capabilities make use of a pointer swizzling technique. When a key first comes off the disk, it is said to be unprepared. An unprepared key contains an allocation, object identifier, and a few other fields:
1.1. Preparing a KeyThe first time that an unprepared key is used, the key is prepared. Preparing a key involves several steps:
A prepared key points to the object table entry, which in turn points to the object:
By using an indirection table design, the IPC path can usually copy keys without regard to whether they are valid, so no memory reference is required to anything other than the key itself in order to copy a key. This proves to give a significant improvement in IPC times. To handle object rescind and object frame reuse, EROS reserves somewhere between 10% and 15% more object table entries than there are object frames in core.
Design Note: It is unclear how many extra indirection table entries are needed. It may be that as low as 1% would work fine in most cases. An indirection table entry is allocated under the following circumstances:
The rescind operation need not be terribly efficient. The key prepare case, however, is complicated by the checkpoint mechanism. When a checkpoint occurs, approximately 25% of the objects in memory will prove to be dirty. Mapping entries to these objects will be rendered read-only, and the objects will be marked as hazarded by checkpoint (which prevents mutation). When a process then attempts to mutate one of these objects, a copy-on-write operation will occur. This causes the object table entry to be updated to point to the new version of the object, at which point execution can proceed. The major source of overrun, then, is due to the interval between time an old, clean, free object is replaced by a newly loaded object and the time the scavenger releases the object table entry for that old object. Contrary to what you might expect, the number of entries required is a function of the incoming object fault rate rather than the pageout rate. Note, however, that a complete scavenge of memory can be done every 1000 object faults or so without terribly significant overhead. 1.2. Removing an Object from MemoryWhen an object (a page or node) is removed from memory, there may exist prepared keys to it. The keys are invalidated by invalidating the object table entry:
A stale OT entry can be detected by the fact that the least significant bit of the pointer field is 1 (all valid pointers are word pointers). If bit 0 of this field is 1, then bit 1 indicates whether the object was removed or rescinded, and bit 2 is used by the OT scavenger. As nodes are forced out to disk, the keys in them are deprepared. If a prepared key is found to point to a stale OT entry at deprepare time, what happens depends on whether the object was removed or rescinded:
Both object removal and object rescind are low frequency events. Unreferenced OT entries are returned to the free OT pool by an OT scavenger. HazardsThe removal strategy depends on being able to locate all dependencies on the object being removed. For pages, page dependency tables are maintained that are described below. For nodes, the dependencies on the node are completely embodied in the dependencies on the key slots, and a key slot dependency table is used. 1.3. Call and Allocation CountsOne design problem in single-level-store capability systems is the problem of invalidating stale capabilities when objects are allocated to new users. EROS uses a 48-bit allocation count to accomplish this. Both the object and the key contain a 48 bit number indicating the number of times the object has been reallocated. If these numbers do not match, the key is invalid. Objects are not reallocated very often, so the allocation count value is not unduly threatened by overflow. The EROS system guarantees that every call operation will receive at most one return. This is accomplished by keeping a count of the number of calls a domain has performed, and placing in each resume key the count associated with the corresponding call. If the resume key count does not match the domain's current call count, the resume key is invalid. Because the call count must be incremented with each invocation, call counts are threatened by overflow. We have therefore chosen a 48 bit allocation count. A 48-bit count is large enough for the next two decades. Assuming a 32 cycle round-trip invocation time (an implausibly small number), such a counter takes 253 cycles to roll over. On a machine with a femtosecond (2-15) clock, this works out something over 8 thousand years. While it's conceivable that a single system image might run that long, we feel reasonably confident that we can scavenge the persistent store every 2 thousand years without noticable overhead. This calculation, of course, makes the optimistic assumption that memory references will take only 1 cycle. 1.4. Alternative DesignsIn KeyKOS, all prepared keys to an object are kept on a circularly linked list. If the object is removed, this circularly linked list can be walked to deprepare all of the keys. While simple and elegant, this design suffers from two limitations:
Because of these issues, EROS has abandoned the key chain approach. 2. Mapping Entry DependenciesEROS constructs memory segments as a tree of nodes and pages. Whenever a mapping entry needs to be built (discovered when a page fault occurs), the segment tree is traversed from top to bottom to build the needed entry. At each layer in the segment tree we traverse through some slot in a segment node. Four actions may force us to invalidate existing mapping entries:
Because it does not maintain a Key chain, EROS must adopt a different solution. 2.1. Managing Slot and Node DependenciesIn principle, key slot dependencies can be captured in three ways:
Options (A) and (B) are equivalent if the mappings are discarded in a way that takes into account which slot was altered. Option (B) has two advantages: it slightly simplifies Node removal and is somewhat more compact. I had initially hoped to adopt this approach in EROS, but it proved much harder to code than I had expected. Basically, it's easier to consolidate entries of the form (Key *, PTE*) than it is to consolidate node entries. Every time we traverse a key slot, we add a depend entry for that key slot which names the mapping entry we were in the process of building when the traversal occured. If a slot in that node is later altered, we invalidate all of the mappings shadowed by that slot. On machines implementing tree-structured mapping, it is usually possible to consolidate multiple entries for the same key. If two different mapping entries in the same tree page were generated by the same key slot, we can usually assume that all of the intervening mappings were also generated by that key. There are perverse mappings under which this is not true, but the impact on these mappings is that they will get invalidated more often if they are changed. We accept this performance impact in the interest of efficiency. Given all of this, the data structure used to capture the key slot dependency information takes up 2 words on a 32-bit architecture:
struct KeyDependEntry {
PTE *start; // First entry to zap
Word pteCount : 12; // Number of entries
Word slotTag : 20 // hash of key address
};
This assumes a 1024 set depend entry cache, and takes advantage of the fact that key addresses are word aligned. Dependencies for Hash-Structured Hardware MappingMachines implementing hash-structured mapping do not readily lend themselves to the entry consolidations available under tree-structured mapping schemes. Two tricks can be used to achieve a similar effect, and these tricks interact to mutual benefit. First, hash table entries are allocated in chunks of 8 adjacent virtual pages. This gives them some locality of reference, which enables a degree of consolidation in segment nodes of smaller degree. Key slots of larger degree must be captured using a different structure:
struct NodeDependEntry {
uva_t startAddr; // First entry to zap
uva_t endAddr; // First entry to zap
asid_t asid; // Address space
Key *node; // Node address
};
The range of virtual addresses is used to walk the hash structure by hand. 2.2. Managing Page and Mapping Table DependenciesTo solve parts (3) and (4) of the dependency mapping problem, we could adopt the solution that KeyKOS used for mapping table pages to handle user pages as well -- just add the address of the page frame into the depend table with a back pointer for each mapping table entry. While this would work, it requires 4 words of overhead for every page mapping (which is most of the in-core pages) where roughly 1.25 will do the job. The l2span field would always contain 0 in such cases, and the mapping table entry itself contains the address of the page frame. Given this, we instead use a smaller structure:
struct PageDependEntry {
PTE *entry; // Entry to invalidate
};
Note that this structure is pretty much independent of the mapping architecture. While the mapping entry layout varies from machine to machine, the notion that there is one per page appears relatively universal. 2.3. Multiset Cache Rather than Hash TableRather than adopt the KeyKOS hash table design, we decided to go for a multiset cache style of table management. We expect that the size distribution will turn out to have a low standard deviation, which makes the next pointer unprofitable. In addition, Shapiro's experiences with TLB modeling strongly suggest that the outlier cases can be handled with a fully associative overflow cache. 3. Domain Root DependenciesNodes that act as General Registers, Key Registers, or Domain Roots have a special relationship under the domain contract. If any of the components of a domain is removed from memory, the domain must be deprepared. This problem is unrelated to the issues mentioned above, but is included here so that this note will cover all of the dependency management issues. The domain contract problem is straightforward: the nodes in a domain are prepared and deprepared as a group, and the two auxiliary nodes contain pointers to the root node. This means that if any of the three is removed, the others can be found and deprepared at the appropriate time. Copyright 1998 by Jonathan Shapiro. All rights reserved. For terms of redistribution, see the GNU General Public License |