[cap-talk] Styles of persistance

Bill Frantz frantz at pwpconsult.com
Fri Apr 4 20:46:36 CDT 2008


sam at samason.me.uk (Sam Mason) on Friday, April 4, 2008 wrote:

>On Thu, Apr 03, 2008 at 05:11:48PM -0700, Bill Frantz wrote:
>> shap at eros-os.com (Jonathan S. Shapiro) on Monday, March 31, 2008 wrote:
>> 
>> >In most cases, the best way to upgrade an object is to serialize its
>> >state, instantiate the new version, and have the new version read in the
>> >old state (possibly using an intermediate, transient format converter to
>> >separate out the conversion logic).
>> >
>> >To the extent that this is true, the only objects requiring a complex
>> >upgrade protocol are objects that require on-line update.
>> 
>> Perhaps an interesting use case is the
>> KeyKOS/EROS/CapROS/Coyotos(?) space bank. Instances of the space
>> bank have life-times from minutes to the life of the system.
>> Serialization is not used to move it from system to system. :-) It
>> has a significant internal data structure to keep track of
>> in-use/free, and the objects to be rescinded when a space bank is
>> zapped. And, as allocation strategies change, it is likely to need
>> to be changes.
>> 
>> It would be useful to get the upgrade path for the space bank right
>> in the first release.
>
>How did KeyKOS/EROS handle code upgrades/bug fixing?  With system
>lifetimes of years I'd expect it would be handled and the problem was
>solved somehow, but I'm not sure how.  Or do you just sidestep the
>problem by making the spacebank so basic that it functions more like
>mmap (or whatever malloc uses these days) in unix systems.

Let me start by saying how I think they should be handled, then
I'll relate our experiences which lead me to this conclusion. I
hope Jonathan will agree with these recommendations.

In order of decreasing desirability:

1. Avoid long-lived objects. For example, instead of keeping an
instance of a reusable compiler, build a new instance of a use-once
compiler for each compile. The need to upgrade a short-lived object
is much less acute than a long-lived one. Each time you start with
a new instance, you can start, relatively painlessly, with a new
version.

2. Build long-lived objects that can serialize their state, load a
new version, and have that new version reload the serialized state.
This approach is suitable if the delay introduced by the serialize
- reload cycle is acceptable. There are two sub modes in this
category:

2a. Build the new version in a new object and discard the old
version. All capabilities to the old version become invalid, and
users must be rebound to the new version (somehow). This approach
is desirable because it assumes the least about the relationship
between the internal structure of the two versions.

2b. Build the new version on the ashes of the old version so old
capabilities remain valid.

3. Design, from the beginning, for the object to be able to upgrade
itself. This approach seems to be the only viable one for the space
bank.


I think option 3 is at least as hard as building a general-purpose
serialization system, but it hasn't been studied anywhere near as
much. Let me explore it using the space bank as an example.

The first thing to note is that there is short-term transient data.
We can avoid having to upgrade that data by only checking for
upgrades when it is no longer needed. So we check for upgrades when
the stack is (nearly) empty. For the space bank, this check is
performed when it is either about to return to a caller, or when it
has just been called. The example below will assume the test is
made just after the space bank has been called.

For the data which must be upgraded, the long-term state, we should
first isolate it in a separate segment from the code, stack,
transient heap etc. We can then pass that segment as a parameter to
the upgrade code when an upgrade event occurs.

For the space bank, one approach would be to have a shared key slot
somewhere which is checked for an upgrade event. If the slot is the
same as the last upgrade performed (EQ again for people keeping
score), the code packages up everything needed for the upgrade,
which might include the parameters of the most recent call, a
domain capability to the space bank domain, and the data segment,
and calls the upgrade program. The upgrade program performs any
necessary changes to the data segment, installs the new code in the
old domain (keeping old capabilities valid), and sets it running to
handle the call.

In most cases, the data representation of the long-term state will
not change from version to version. This makes its upgrade trivial.
When it does change, the upgrade program must be smart enough to
allocate space for the new long-term state segment following the
internal rules of the old version of the space bank, using the old
and new long-term state. (Ugh!)


Now to the original question, "How did KeyKOS/EROS handle code
upgrades/bug fixing?" For short-lived objects we didn't have a
problem. For long-lived objects, like the space bank, we tried to
avoid the need for upgrade. The first technique we used was to keep
the specifications for these objects simple. Simple specifications
frequently lead to simple code, and simple code has fewer bugs, and
needs to be upgraded less often. I don't think we ever had to
upgrade the space bank. (We made a radical change to its
architecture when we re-coded the system in C, but that was a new
system initialization on a different architecture computer. It
would be a useful thought experiment to design an upgrade system
which would handle that architecture change.

The one long-lived object we actually upgraded was the clock object
<http://www.agorics.com/Library/KeyKos/Gnosis/132.html#clock>. I
can't remember why we upgraded the clock, but one of our objectives
was to gain greater understanding of the issues involved in
upgrade. Since the original clock object was not designed to assist
in upgrade, the process ended up being a bit like open-heart
surgery.

Let me digress a bit to discuss the environment. The clock object
is created by a factory
<http://www.agorics.com/Library/KeyKos/Gnosis/68.html>. The
high-level goal of the factory is to be able to create instances of
objects with assurances that they can't steal your data. To achieve
this, the factory tightly controls the out-bound channels of
communication available to the objects it creates. However, it
doesn't attempt to control the in-bound communication. So, while
the code of the clock object is a read-only segment, so the clock
object can't use its code segment as a communication channel,
nothing prevents something from outside the factory from holding a
read-write capability to the same segment.

To upgrade the clock, we used the read-write capability to the code
segment to patch the running program. We moved some of the
instructions just after it received a new call to an unused part of
the segment. (If the whole segment had been in use we would have
been SOL.) We also added code there that performed the upgrade. We
then put a branch to the added code just after received new call
and the upgrade then ran (wheeeeew). But the whole process was ugly
beyond belief.

This reply has been much longer than I anticipated, and if you feel
like the youngster who, upon returning a book on penguins to the
library said, "This book told me more about penguins than I really
wanted to know.", I well fully understand.

Cheers - Bill

---------------------------------------------------------------------------
Bill Frantz        |"We used to quip that "password" is the most common
408-356-8506       | password. Now it's 'password1.' Who said users haven't
www.periwinkle.com | learned anything about security?" -- Bruce Schneier



More information about the cap-talk mailing list