The checkpoint code is the most complex code in the EROS kernel (perhaps the duplexed I/O code comes close, but that is going away soon). It has undergone the greatest number of revisions and has been the source of the most frustrating (and difficult to locate) bugs. In this note, I'm documenting the final design for the checkpoint area, along with some explanatory notes. 1. BackgroundThe EROS checkpoint design differs significantly from the KeyKOS design, but many of the ideas of the KeyKOS design still pertain. The KeyKOS design is described in The Checkpoint Mechanism in KeyKOS. The differences between the two systems can be summarized as follows:
The following sections describe how the EROS checkpoint area is managed in detail. 2. On-Disk Checkpoint AreaThe EROS checkpoint area is made up of ranges of disk page frames that are sequentially numbered beginning at zero. These are referred to as log ranges. Just as page frames in object ranges are referred to with oids, frames in the log are referred to with lids. An lid is constructed by taking the page frame number and concatenating an 8-bit object offset. Frames zero and one hold the checkpoint headers of the two most recently stabilized checkpoints. Either header A or header B may represent the most recently stabilized checkpoint. The most recently written header can be determined by comparing the sequence numbers in the header pages. Frame 0 1 2 N +-------+-------+----------------------------------------+ | hdr A | hdr B | log page frames, dynamically allocated | | | | to various checkpoint generations | +-------+-------+----------------------------------------+ LID 0x0 0x100 0x200 ... N<<8 Log ranges may reside on multiple disks, but the present implementation requires that sequential numbering be preserved. This presents some difficulties for checkpoint area rearrangement that we will need in due course to solve with suitable user-level applications. A checkpoint that has been completely written to the disk is said to be stabilized. A stabilized checkpoint has an associated on-disk directory that describes where all of the objects in that checkpoint reside. This information is stored in a two-level directory structure. 2.1 The Checkpoint HeaderThe root of each directory is the checkpoint header (either A or B). The checkpoint header page contains begins with a header structure describing the checkpoint, followed by the lid values of each second-level header in the checkpoint area: +-----------------+ | Header | +-----------------+ | reserve dir lid | | .... | | reserve dir lid | +-----------------+ | thread dir lid | | .... | | thread dir lid | +-----------------+ | object dir lid | | .... | | object dir lid | +-----------------+ The header structure is:
struct DiskCheckpointHdr {
uint64_t sequenceNumber;
bool hasMigrated;
LogLoc maxLogLoc;
uint32_t nDirPage;
uint32_t nThreadPage;
uint32_t nRsrvPage;
};
Each of the fields is explained below.
2.2 The Reserve DirectoryEach ``reserve directory lid'' in the checkpoint header names the location of a reserve directory page. Every reserve directory page consists of a single word describing the number of following entries, and then some number of reserve descriptors:
struct CpuReserveInfo {
uint32_t index;
uint64_t period;
uint64_t duration;
uint64_t quanta;
uint64_t start;
int rsrvPrio;
int normPrio;
};
The reserve entries from the most current valid checkpoint are reloaded as part of the system startup process. 2.3 The Thread DirectoryEach ``thread directory lid'' in the checkpoint header names the location of a thread directory page. A thread directory page consists of a single word describing the number of following entries, and then some number of thread descriptors:
struct ThreadDirent {
OID oid;
ObCount allocCount;
uint16_t schedNdx;
};
An entry in the thread directory corresponds to a thread that was running at the time the checkpoint was taken. The thread entries from the most current valid checkpoint are reloaded as part of the system startup process. The oid field identifies the process root node of the process currently occupied by this thread. The allocCount field must match the allocation count of the node. The allocation count must be preserved because threads can sleep for long periods of time during which the process they occupy may be rescinded. 2.4 The Object DirectoryThe remaining directory lids in the checkpoint header identify object directory pages. Each entry holds an object identifier, an allocation count, the type of the object, and the lid at which the object can be found.
struct CkptDirent {
OID oid;
ObCount count;
LogLoc logLoc : 24;
uint8_t type;
} ;
For pages, the count field contains the allocation count. For nodes, it contains the call count. Is this correct? If the lid field of the directory is zero, then the corresponding object is either a zero-filled page or a node filled with null keys whose call count is equal to its allocation count. Which one can be determined by the value of the type field. If the lid field is nonzero, the corresponding checkpoint frame contains either page (or capability page) data or a ``log pot.'' A log pot is simply a page-sized cluster of nodes in the checkpoint area. This is why the lid value is concatenated with an object index: the index indicates which entry in the log pot contains the relevant object. 3. Theory of OperationThe basic idea behind the EROS checkpoint design is that we will divide the information in the checkpoint area into several generations. The current generation corresponds roughly to the paging area of conventional operating systems. The checkpoint generation is the most recently declared checkpoint, which may be only partially written to the disk. The stable generation is the most recent checkpoint that is entirely written to the disk. At any given time, there is either a checkpoint generation or a stable generation. The migrated generations remain in the checkpoint area until their storage is reclaimed, but have been fully copied to their home locations.
Whenever an object in memory is dirtied, space is reserved for it in the current checkpoint generation. If necessary, space will be made available by throwing away objects in the oldest generation to create available storage. Execution proceeds until the checkpoint interval (typically 5 minutes) has passed or the current generation has come to occupy more than some fixed percentage of the checkpoint area (typically 65%). When either event occurs, a checkpoint is declared. All generations move one step to the right, and the rightmost generation is discarded. Objects may have been written to the current generation due to ageing, but at the time the checkpoint is declared many of these objects will still be in memory. Once the checkpoint is declared, the current generation becomes the checkpoint generation, and a new current generation is started. Checkpointed objects in memory are marked ``copy on write,'' and execution resumes. All of the in-memory objects associated with the current generation are then written out to the disk asynchronously by the checkpointer process. At any given moment, there is either a checkpoint generation or a stable generation. When the last in-memory object in the checkpoint generation has been written to disk, that generation becomes the stable generation. Migration now begins. During the migration phase, objects in the stable generation are copied from the stable generation to their home locations on the disk. Once this has completed, the generation is considered to have migrated, and the on-disk checkpoint header for this generation is rewritten to indicate that migration is done. In principle, EROS can be configured for any number of generations greater than three. In practice, the current system is configured for exactly three. 4. The Checkpoint MapThe checkpoint area storage is interpreted simultaneously at two layers:
4.1 Frame ReservationBefore any object can be dirtied, space for it must be reserved in the current checkpoint. Herein lies the first source of complexity. Because there are multiple nodes (directory entries) per page, the checkpoint logic must maintain reservation information at two levels for such objects. When allocating a new node (directory entry) would overflow the current frame, a new frame must be reserved. The catch is that we do not actually decide where to store the newly dirtied object when space for it is reserved. Instead, we simply keep a count of how many objects of each type have been reserved. The reason for this is that a long time may go by between dirtying the object and writing it out to disk, and we would therefore like to decide as late as possible the location in the log at which the object will be written. 4.2 Frame AllocationAt some later time, we actually write the dirty object to the disk. At this point we actually allocate the log frame that the object will go to. If multiple objects will fit in the frame, a count of the number of slots allocated is kept in addition to the count of the number of allocated frames. Here we arrive at the second complication. An object can be dirtied (reserving space), written out (consuming space), and subsequently redirtied. When this occurs, there are two possible design options:
We adopt the second policy. Since the bottom-level free frame logic simply needs to know whether a frame is in use, we have the frame allocation map keep a count of how many objects occupy a given frame. When this count goes to zero the frame can be reused. One reason for adopting the second policy is that a great many of the objects written to the checkpoint area will prove to be zero objects. For our purposes, a zero object is any of the following:
The reason these occur with such frequency is that returning storage to the space bank causes that storage to be zeroed. This is not currently true, but we should make it true. The reason to do it is that objects returned to the space bank are generally dirty, and handling them this way obviates the need to rewrite now-dead data back to the checkpoint area (or, in the case of pages, to the home locations). By storing them as zero objects, the write bandwidth requirements can be reduced significantly. The advantage to handling zero objects specially is that they do not occupy any storage in the checkpoint log. The directory entry for a zero object suffices to indicate that the object is zeroed. 4.3 AlgorithmsThere are two algorithms of interest in the checkpoint logic: reserving the storage for an object and writing an object to the disk. 4.3.1 Space ReservationWhen an object is dirtied, the storage reservation algorithm goes as follows:
If no directory space for the object has been reserved
reserve in-core directory space for object
reserve on-disk directory space for object
Reserve space for non-zero object
If (reserved frames - released frames) exceeds 65%
unreserve everything
declare checkpoint
If this object was previously written to disk
deallocate any previous storage
Note the check in the middle for checkpoint size limits. This is where a checkpoint will be declared for reasons of space exhaustion. Until the time comes to actually write the new object, we must assume that it will be non-zero and allocate space accordingly. The deallocation of the on-disk object can, under one boundary condition, cause minor trouble. Nodes go to disk in page-sized chunks called log pots. Suppose that a node is dirtied, written to the log pot that we are currently assembling (at which point the node is cleaned), and subsequently redirtied. When it is redirtied, the allocation count associated with the log pot is decremented. If it chances that this decrement causes the log pot allocation count to go to zero, the pending log pot will be incorrectly considered reclaimable. This can only arise under conditions of total memory starvation, and yes, we found this the hard way. The solution is to simply maintain an ``extra'' allocation on the log pot while its construction is in progress. 4.3.2 Object WriteThe other algorithm of major interest is the object write algorithm. The main source of interest is the handling of objects that prove to be zero objects when the time comes to write them.
If object is zero object
unreserve disk space for object
update in-core checkpoint directory entry accordingly
else
write object to disk
4.4 Restart ActionsWhen EROS is restarted after a shutdown (orderly or otherwise), it first locates the most recent checkpoint header. It reloads the thread and reserve directories, and then checks to see if the checkpoint has migrated. If not, it also reloads the object directory. While doing this, the restart logic builds an in-core map of which page frames in the checkpoint area are part of the current checkpoint, and are therefore not available for allocation. This is called the checkpoint map. The checkpoint map maintains a count, for each page frame in the checkpoint area, of the number of objects currently stored in that frame. If a frame has no objects stored, it is considered reclaimable. The checkpoint headers and directory pages are each considered to contain one object. The original KeyKOS design did not reclaim checkpoint pages, and therefore used a simple bitmap rather than a counted map. In EROS, we have observed that log pots containing ``hot'' nodes frequently empty themselves after being paged out, and it is beneficial to be able to reuse them. 5. The Core Generation StructureEach checkpoint generation has an associated in-core structure. This structure maintains book-keeping information concerning the checkpoint and contains the root pointer for the in-core checkpoint directory.
struct CoreGeneration {
bool canReclaim;
CoreDirent *oidTree;
uint32_t nCoreDirent;
uint32_t nDirent;
uint32_t nReservedFrames;
uint32_t nAllocatedFrames;
uint32_t nDirPage;
uint32_t nLogPot;
uint32_t nZeroPage;
uint32_t nZeroNode;
uint32_t nPage;
uint32_t nNode;
uint32_t nThread;
uint32_t nReserve;
// Record of storage we have released:
uint32_t nReleasedNode;
#ifdef PARANOID_CKPT
uint32_t nAllocDirPage;
uint32_t nAllocPage;
uint32_t nAllocLogPot;
#endif
LogLoc curLogPot;
uint32_t nNodesInLogPot;
};
For each type of object (page, zero page, node, zero node, thread, reserve), the core generation structure records how many of these objects have been allocated in this generation. An object is allocated the first time it is written to the checkpoint area. Nodes are The various fields are:
struct CoreDirent {
CoreDirent *left;
CoreDirent *right;
CoreDirent *parent;
OID oid;
ObCount count;
LogLoc logLoc : 24;
uint8_t type;
enum { red = 1, black = 0 };
uint8_t color;
};
Copyright 1998 by Jonathan Shapiro. All rights reserved. For terms of redistribution, see the EROS License Agreement |