faster checkpoint
Jonathan Shapiro
shap@viper.cis.upenn.edu
Mon, 17 Jul 1995 16:28:06 -0400
In a discussion last night, a thought occurred to me about checkpoint
performance. This may be the way things are actually done in KeyKOS,
but it's not quite the way the checkpoint mechanism was described to
me.
As I understand the checkpoint mechanism in KeyKOS, the system is
temporarily frozen. A "mark" pass is then made through all of the
objects, marking the dirty ones as needing to be checkpointed. During
the mark phase, all write-enable bits in PTE's are disabled, and a
record is constructed of which domains held running processes. This
phase all occurs during the temporary freeze.
The system is then unfrozen, and a sweep pass is started to actually
force these objects to disk.
Neil Chase asked me how this would scale with the size of the system
core memory. My first reaction was that you can use the in-core
checkpoint area directory to reduce the walk to be a function of the
number of *dirty* pages rather than the number of pages, provided you
have some way to find all of the PTE pages. Since he is thinking
about using large, primarily read-only data sets, this probably
addresses his real question.
As I pondered this, though, a minor revision to the checkpoint
mechanism occurred to me that seems to remove the core walk from the
mark phase. Instead of viewing the checkpoint bit in the object as a
hazard marker, view it as a one-bit generation marker. You can mutate
a page only if it is in the proper generation.
At checkpoint time, we save the proces list and zap all of the
write-enable bits on the PTE's as before, but somehow keep a chain or
a bitmap of all of the page table pages so they are easy to find. We
then change a single global variable which says whether we are in
checkpoint generation 0 or checkpoint generation 1. The mark phase is
now complete, and we allow the system to resume.
The sweep phase proceeds through all of the objects in memory as
follows: If the object is in the old checkpoint generation but not
dirty, simply flip it's checkpoint generation bit without any other
action. If the object is in the old checkpoint generation and dirty,
arrange to write the object to the swap area, and flip the bit when
the write completes.
In parallel with the sweep phase, we proceed to try and restart
processes. If the access (read OR write) is to a non-mutated object
in the old generation, we just flip the checkpoint bit (exactly as the
checkpointer would). If the access is a read access to a mutated
object in the old generation, we allow it. If it is a write access,
we attempt to xerox the object into an available frame and attempt to
schedule an I/O on the object (but we do not block on the I/O).
Failing all of that, we block until the sweep pass gets to us or some
such. [In short, the write accesses proceed essentially as they used
to.]
Basically, looking at the bit as a toggle bit instead of a hazard bit
eliminates the walk through all objects (or even all dirty objects)
during the frozen phase. The down side is that it requires the sweep
pass to touch *all* of the objects, not just the objects in the
checkpoint directory.
Jonathan