[cap-talk] Secure Restart or Trusted Recovery?

Jonathan S. Shapiro shap at eros-os.com
Sat Jan 6 07:14:19 CST 2007

On Fri, 2007-01-05 at 18:27 -0800, Jed Donnelley wrote:
> >Cold start: when the machine has power turned off, load OS after machine
> >power on,
> >and
> >Warm start: when the machine has power turned on, load OS after machine
> >reset.

Yes. In legacy systems, where startup-time processing is lengthy, a warm
reboot can bypass much of the system initialization processing: daemon
startup, many device initializations, and so forth.

One way to subjectively experience the difference is to grab any recent
laptop and note the difference in power-up time between "wake from
suspend" (i.e. system image quiesced, but retained in memory) and "wake
from hibernate" (i.e. system image saved to disk, memory image lost).
Ignore the time to hibernate to disk, since a persistent system would
mostly do this in the background.

I note that Windows de-hibernate loads a lot of state that isn't needed.
A decently implement cold start should be faster (though still not as
fast as warm start). Doing all of this behind the operating system's
back is surprisingly inefficient.

I used to believe that the warm-start case was rare and therefore not
interesting, but laptop users really do love the "suspend" mechanism.

> The above Cold/Warm distinction seems nearly vacuous to me.  I guess in
> principle the Warm start as above could recover content from memory, but
> did it?  Are the above two means to the same end?

Yes it recovers from memory. Theoretically they achieve the same result,
but often not so in practice. Regrettably, many hardware devices cannot
be fully reset from software, and of course the driver authors take
shortcuts on the reset process in the interest of performance.

> While we're at terminology I'll mention a term that's floated around the
> supercomputer "industry", "Checkpoint/Restart".  This term is also a
> bit ambiguous.  For example, there is this:
> http://computing-dictionary.thefreedictionary.com/checkpoint%2Frestart
> that seems to apply to a whole system, though it seems to ignore
> changes to rotating storage that happen after a checkpoint.
> Here's a notion that's a bit more familiar to me:
> http://ftg.lbl.gov/CheckpointRestart/CheckpointRestart.shtml
> as I know some of the people who worked on that scheme for a time.
> Here's another per application notion of Checkpoint/Restart:
> http://publib.boulder.ibm.com/infocenter/pdthelp/v1r1/index.jsp?topic=/com.ibm.entcobol4.doc/cpchk02.htm
> I spent a bit of time looking back through this thread and some of
> the historical links that it derived from.

Thanks. These are good references, and I look forward to reading them.
Two more for you:

  The Checkpoint Mechanism in KeyKOS

  Design Evolution of the EROS Single-Level Store

> May I ask, independent of the name: "Secure Restart" or "Trusted Recovery",
> what mechanism is actually being discussed?

I haven't read the papers you cite, but I would say that in the "whole
system" checkpoint/restart case, the checkpoint/restart is providing a
secure restart mechanism, because the system is being recovered to a
previously known-good state.

However: my statement is subject to the assumption that the system
checkpoint phase stabilizes a consistent cut of the system. If it
doesn't, then the restored state cannot be considered known-good w.r.t.
the secure system state induction.

> What I describe above as a "hot start" was the most ambitious.  This could
> be used after some sort of system failure (e.g. a crash, hardware fault,
> etc.) when main memory was still intact.

I have seen systems that attempt this. Given a CRC or suitable checksum,
I see no difficulty with this, but not without them. The difficulty is
that a crash often involves stray pointers within the kernel, and in the
absence of a redundant check it is not possible to believe that the
in-memory objects are still valid.

EROS maintains the necessary checksum, but the PC clears memory on most
reboots so no hot-boot mechanism has been attempted. Since our log is
stored linearly on disk, it's not clear that the benefit would be that
big in practice.

>   An initialization/recovery
> program would overlay where the system kernel would eventually be
> placed and it would look through memory for process state that could
> be written to disk to allow processes that were running at the time of a
> crash to be recovered and restarted after the subsequent continuation
> that had the effect of a "warm start" (system load, processes recovered
> from disk).

Why write it to disk? The goal is to retain it in memory!


More information about the cap-talk mailing list