The Checkpoint Mechanism

 
 

The checkpoint code is the most complex code in the EROS kernel (perhaps the duplexed I/O code comes close, but that is going away soon). It has undergone the greatest number of revisions and has been the source of the most frustrating (and difficult to locate) bugs.

In this note, I'm documenting the final design for the checkpoint area, along with some explanatory notes.

1. Background

The EROS checkpoint design differs significantly from the KeyKOS design, but many of the ideas of the KeyKOS design still pertain. The KeyKOS design is described in The Checkpoint Mechanism in KeyKOS. The differences between the two systems can be summarized as follows:

  • KeyKOS uses two checkpoint areas of equal size (what Landau calls the checkpoint area and the working area). It fills the working area up, declares a checkpoint, swaps the checkpoint and the working areas, and then drains (migrates) from what is now the checkpoint area.

    EROS uses a single checkpoint area in a circular fashion, and divides this area dynamically into several generations. A generation is permitted to occupy up to a fixed percentage of the checkpoint area (typically 65%) before a checkpoint is declared. A new generation is then started while the checkpointed generation is drained. The checkpointed generation then becomes the migrating generation. This devolves to the KeyKOS behavior under normal conditions, but provides greater flexibility in the face of bursts of object modifications. It also allows the checkpoint area to grow more easily.

  • KeyKOS never reuses space in the checkpoint area; if an object is modified a second time it gets a new location. The purpose of this is to avoid seeks and head settling delays where possible.

    EROS does reuse space in the checkpoint area, but uses an allocation strategy that is designed to maximize sequential allocation. Our rationale is that disk tracks are larger than they were a decade ago, so we don't take as many seeks/settles.

  • Once KeyKOS begins to reuse a checkpoint area, all memory of that area's prior content is discarded.

    EROS preserves a record of object locations in the checkpoint area until their storage is actually reused. It should therefore perform fewer seeks to areas outside the checkpoint area.

  • KeyKOS has no equivalent to the EROS reserve table.

The following sections describe how the EROS checkpoint area is managed in detail.

2. On-Disk Checkpoint Area

The EROS checkpoint area is made up of ranges of disk page frames that are sequentially numbered beginning at zero. These are referred to as log ranges. Just as page frames in object ranges are referred to with oids, frames in the log are referred to with lids. An lid is constructed by taking the page frame number and concatenating an 8-bit object offset.

Frames zero and one hold the checkpoint headers of the two most recently stabilized checkpoints. Either header A or header B may represent the most recently stabilized checkpoint. The most recently written header can be determined by comparing the sequence numbers in the header pages.

     Frame 0    1    2                                        N
    +-------+-------+----------------------------------------+
    | hdr A | hdr B | log page frames, dynamically allocated |
    |       |       | to various checkpoint generations      |
    +-------+-------+----------------------------------------+
     LID 0x0  0x100  0x200  ...                            N<<8
    	      

Log ranges may reside on multiple disks, but the present implementation requires that sequential numbering be preserved. This presents some difficulties for checkpoint area rearrangement that we will need in due course to solve with suitable user-level applications.

A checkpoint that has been completely written to the disk is said to be stabilized. A stabilized checkpoint has an associated on-disk directory that describes where all of the objects in that checkpoint reside. This information is stored in a two-level directory structure.

2.1 The Checkpoint Header

The root of each directory is the checkpoint header (either A or B). The checkpoint header page contains begins with a header structure describing the checkpoint, followed by the lid values of each second-level header in the checkpoint area:

    +-----------------+
    |     Header      |
    +-----------------+
    | reserve dir lid |
    |      ....       |
    | reserve dir lid |
    +-----------------+
    | thread dir lid  |
    |      ....       |
    | thread dir lid  |
    +-----------------+
    | object dir lid  |
    |      ....       |
    | object dir lid  |
    +-----------------+
    	      

The header structure is:

    struct DiskCheckpointHdr {
      uint64_t   sequenceNumber;
      bool       hasMigrated;
      LogLoc     maxLogLoc;
    
      uint32_t   nDirPage;
      uint32_t   nThreadPage;
      uint32_t   nRsrvPage;
    };
    	      

Each of the fields is explained below.

sequenceNumber

The sequence number allows EROS to determine which checkpoint is most recent. When restarting, EROS first loads the checkpoint headers at lid=0x0 and lid=0x100. If a drive failure occurred while writing a header (detected by bad CRC while rereading the corresponding page frame), that header is discarded. Of the valid headers, EROS then chooses the header with the largest sequence number as the most recent checkpoint.

Note that on recovery EROS does not attempt to reconstruct all of the checkpoint generations that may be present in the checkpoint area. It loads only the most recent one.

hasMigrated

The ``has migrated'' field indicates whether migration completed for a given checkpoint generation. If migration has completed, all of the objects in that generation have been successfully copied to their home locations. If not, migration must be restarted.

Once migration has completed, EROS is free to begin reusing the object storage associated with a checkpoint area. The reserve and thread directories remain valid, but the object directory should be considered invalid.

The migrator should perhaps reset the nDirPage field to zero under these conditions.

maxLogLoc

The maxLogLoc field allows the restart code to verify that it has recovered all of the checkpoint area. maxLogLoc contains the number of log frames that were known to the system at the time of the last checkpoint. If the restart process has not managed to build a complete checkpoint area of at least this number of frames, the system will not restart.

nRsrvPage, nThreadPage, nDirPage

The nRsrvPage, nThreadPage, and nDirPage give (respectively) the number of lid entries of each type in the remainder of the checkpoint header.

2.2 The Reserve Directory

Each ``reserve directory lid'' in the checkpoint header names the location of a reserve directory page. Every reserve directory page consists of a single word describing the number of following entries, and then some number of reserve descriptors:

    struct CpuReserveInfo {
      uint32_t index;
      uint64_t period;
      uint64_t duration;
      uint64_t quanta;
      uint64_t start;
      int rsrvPrio;
      int normPrio;
    };
    	      

The reserve entries from the most current valid checkpoint are reloaded as part of the system startup process.

2.3 The Thread Directory

Each ``thread directory lid'' in the checkpoint header names the location of a thread directory page. A thread directory page consists of a single word describing the number of following entries, and then some number of thread descriptors:

    struct ThreadDirent {
      OID      oid;
      ObCount  allocCount;
      uint16_t schedNdx;
    };
    	      

An entry in the thread directory corresponds to a thread that was running at the time the checkpoint was taken. The thread entries from the most current valid checkpoint are reloaded as part of the system startup process.

The oid field identifies the process root node of the process currently occupied by this thread. The allocCount field must match the allocation count of the node. The allocation count must be preserved because threads can sleep for long periods of time during which the process they occupy may be rescinded.

2.4 The Object Directory

The remaining directory lids in the checkpoint header identify object directory pages. Each entry holds an object identifier, an allocation count, the type of the object, and the lid at which the object can be found.

    struct CkptDirent {
      OID       oid;
      ObCount   count;
      LogLoc    logLoc : 24;
      uint8_t   type;
    } ;
    	      

For pages, the count field contains the allocation count. For nodes, it contains the call count. Is this correct?

If the lid field of the directory is zero, then the corresponding object is either a zero-filled page or a node filled with null keys whose call count is equal to its allocation count. Which one can be determined by the value of the type field.

If the lid field is nonzero, the corresponding checkpoint frame contains either page (or capability page) data or a ``log pot.'' A log pot is simply a page-sized cluster of nodes in the checkpoint area. This is why the lid value is concatenated with an object index: the index indicates which entry in the log pot contains the relevant object.

3. Theory of Operation

The basic idea behind the EROS checkpoint design is that we will divide the information in the checkpoint area into several generations. The current generation corresponds roughly to the paging area of conventional operating systems. The checkpoint generation is the most recently declared checkpoint, which may be only partially written to the disk. The stable generation is the most recent checkpoint that is entirely written to the disk. At any given time, there is either a checkpoint generation or a stable generation. The migrated generations remain in the checkpoint area until their storage is reclaimed, but have been fully copied to their home locations.

Checkpoint Generations

Whenever an object in memory is dirtied, space is reserved for it in the current checkpoint generation. If necessary, space will be made available by throwing away objects in the oldest generation to create available storage.

Execution proceeds until the checkpoint interval (typically 5 minutes) has passed or the current generation has come to occupy more than some fixed percentage of the checkpoint area (typically 65%). When either event occurs, a checkpoint is declared. All generations move one step to the right, and the rightmost generation is discarded. Objects may have been written to the current generation due to ageing, but at the time the checkpoint is declared many of these objects will still be in memory.

Once the checkpoint is declared, the current generation becomes the checkpoint generation, and a new current generation is started. Checkpointed objects in memory are marked ``copy on write,'' and execution resumes. All of the in-memory objects associated with the current generation are then written out to the disk asynchronously by the checkpointer process.

At any given moment, there is either a checkpoint generation or a stable generation. When the last in-memory object in the checkpoint generation has been written to disk, that generation becomes the stable generation. Migration now begins.

During the migration phase, objects in the stable generation are copied from the stable generation to their home locations on the disk. Once this has completed, the generation is considered to have migrated, and the on-disk checkpoint header for this generation is rewritten to indicate that migration is done.

In principle, EROS can be configured for any number of generations greater than three. In practice, the current system is configured for exactly three.

4. The Checkpoint Map

The checkpoint area storage is interpreted simultaneously at two layers:

  • As a set of page frames, each of which is either allocated, reserved, or free.

  • As a set of object containers, each of which is one page in size. A given object container may hold a data page, a capability page, a node, or a directory entry. [They can also hold thread and reserve entries, but these are handled as special cases by the checkpoint logic.]

    If a frame is allocated, then there exists exactly one checkpoint generation associated with that frame, and there is at least one entry in the directory for that generation naming an object in that frame.

4.1 Frame Reservation

Before any object can be dirtied, space for it must be reserved in the current checkpoint. Herein lies the first source of complexity. Because there are multiple nodes (directory entries) per page, the checkpoint logic must maintain reservation information at two levels for such objects. When allocating a new node (directory entry) would overflow the current frame, a new frame must be reserved.

The catch is that we do not actually decide where to store the newly dirtied object when space for it is reserved. Instead, we simply keep a count of how many objects of each type have been reserved. The reason for this is that a long time may go by between dirtying the object and writing it out to disk, and we would therefore like to decide as late as possible the location in the log at which the object will be written.

4.2 Frame Allocation

At some later time, we actually write the dirty object to the disk. At this point we actually allocate the log frame that the object will go to. If multiple objects will fit in the frame, a count of the number of slots allocated is kept in addition to the count of the number of allocated frames.

Here we arrive at the second complication. An object can be dirtied (reserving space), written out (consuming space), and subsequently redirtied. When this occurs, there are two possible design options:

  • Rewrite the object to its previously assigned location.

  • Assign a new location to the object, and free the old location (if any).

We adopt the second policy. Since the bottom-level free frame logic simply needs to know whether a frame is in use, we have the frame allocation map keep a count of how many objects occupy a given frame. When this count goes to zero the frame can be reused.

One reason for adopting the second policy is that a great many of the objects written to the checkpoint area will prove to be zero objects. For our purposes, a zero object is any of the following:

  • A zero-filled data page.

  • A node filled with zero number capabilities.

  • A capability page filled with zero number capabilities.

The reason these occur with such frequency is that returning storage to the space bank causes that storage to be zeroed. This is not currently true, but we should make it true. The reason to do it is that objects returned to the space bank are generally dirty, and handling them this way obviates the need to rewrite now-dead data back to the checkpoint area (or, in the case of pages, to the home locations). By storing them as zero objects, the write bandwidth requirements can be reduced significantly.

The advantage to handling zero objects specially is that they do not occupy any storage in the checkpoint log. The directory entry for a zero object suffices to indicate that the object is zeroed.

4.3 Algorithms

There are two algorithms of interest in the checkpoint logic: reserving the storage for an object and writing an object to the disk.

4.3.1 Space Reservation

When an object is dirtied, the storage reservation algorithm goes as follows:

    If no directory space for the object has been reserved
        reserve in-core directory space for object
        reserve on-disk directory space for object
    Reserve space for non-zero object
    If (reserved frames - released frames) exceeds 65%
        unreserve everything
        declare checkpoint
    If this object was previously written to disk
        deallocate any previous storage
    	      

Note the check in the middle for checkpoint size limits. This is where a checkpoint will be declared for reasons of space exhaustion. Until the time comes to actually write the new object, we must assume that it will be non-zero and allocate space accordingly.

The deallocation of the on-disk object can, under one boundary condition, cause minor trouble.

Nodes go to disk in page-sized chunks called log pots. Suppose that a node is dirtied, written to the log pot that we are currently assembling (at which point the node is cleaned), and subsequently redirtied. When it is redirtied, the allocation count associated with the log pot is decremented. If it chances that this decrement causes the log pot allocation count to go to zero, the pending log pot will be incorrectly considered reclaimable.

This can only arise under conditions of total memory starvation, and yes, we found this the hard way. The solution is to simply maintain an ``extra'' allocation on the log pot while its construction is in progress.

4.3.2 Object Write

The other algorithm of major interest is the object write algorithm. The main source of interest is the handling of objects that prove to be zero objects when the time comes to write them.

    If object is zero object
        unreserve disk space for object
        update in-core checkpoint directory entry accordingly
    else
        write object to disk
    	      

4.4 Restart Actions

When EROS is restarted after a shutdown (orderly or otherwise), it first locates the most recent checkpoint header. It reloads the thread and reserve directories, and then checks to see if the checkpoint has migrated. If not, it also reloads the object directory.

While doing this, the restart logic builds an in-core map of which page frames in the checkpoint area are part of the current checkpoint, and are therefore not available for allocation. This is called the checkpoint map.

The checkpoint map maintains a count, for each page frame in the checkpoint area, of the number of objects currently stored in that frame. If a frame has no objects stored, it is considered reclaimable. The checkpoint headers and directory pages are each considered to contain one object.

The original KeyKOS design did not reclaim checkpoint pages, and therefore used a simple bitmap rather than a counted map. In EROS, we have observed that log pots containing ``hot'' nodes frequently empty themselves after being paged out, and it is beneficial to be able to reuse them.

5. The Core Generation Structure

Each checkpoint generation has an associated in-core structure. This structure maintains book-keeping information concerning the checkpoint and contains the root pointer for the in-core checkpoint directory.

    struct CoreGeneration {
      bool canReclaim;
      CoreDirent *oidTree;
      uint32_t nCoreDirent;
      uint32_t nDirent;
      uint32_t nReservedFrames;
      uint32_t nAllocatedFrames;
    
      uint32_t nDirPage;
      uint32_t nLogPot;
    
      uint32_t nZeroPage;
      uint32_t nZeroNode;
      uint32_t nPage;
      uint32_t nNode;
      uint32_t nThread;
      uint32_t nReserve;
    
      // Record of storage we have released:
      uint32_t nReleasedNode;
      
    #ifdef PARANOID_CKPT
      uint32_t nAllocDirPage;
      uint32_t nAllocPage;
      uint32_t nAllocLogPot;
    #endif
      
      LogLoc curLogPot;
      uint32_t   nNodesInLogPot;
    };
    	      

For each type of object (page, zero page, node, zero node, thread, reserve), the core generation structure records how many of these objects have been allocated in this generation. An object is allocated the first time it is written to the checkpoint area. Nodes are

The various fields are:

canReclaim

Indicates whether the objects in this checkpoint generation have been migrated, and can therefore be reclaimed.

oidTree

Root pointer to a red-black tree of directory entries for each object in this generation.

nCoreDirent

Number of entries in the core directory tree for this generation.

nCoreDirent

Number of entries in the core directory tree for this generation.

nDirent

Number of on-disk directory entries that have been reserved for this generation. Should always be equal to nCoreDirent.

nReservedFrames

Number of disk frames that have been reserved for this generation.

nAllocatedFrames

Number of disk frames that have been allocated for this generation. Should always be less than or equal to nReservedFrames.

nDirPage, nPage, nLogPot,

Number of directory pages, pages, and log pots (respectively) that have been reserved in this generation. These should sum to nReservedFrames.

    struct CoreDirent {
      CoreDirent *left;
      CoreDirent *right;
      CoreDirent *parent;
    
      OID         oid;
      ObCount     count;
      LogLoc      logLoc : 24;
      uint8_t        type;
    
    
      enum { red = 1, black = 0 };
      
      uint8_t        color;
    };
    	      

Copyright 1998 by Jonathan Shapiro. All rights reserved. For terms of redistribution, see the EROS License Agreement