[cap-talk] Moving running applications (was: Kernels at Microsoft)
Jed Donnelley
jed at nersc.gov
Mon Nov 26 17:49:47 EST 2007
On 11/21/2007 5:18 PM, Norman Hardy wrote:
> I highly recommend the talk by Eric Traut of Microsoft that is online at
> http://endeavour.acm.uiuc.edu/UIUC-ACM-RP07-Traut.wmv
I was also favorably impressed by this talk. I can add my
recommendation. I took some notes at home that perhaps I'll
share at some future time. However, here I want to comment
on:
> I. Moving a running app to another platform is something that Keykos
> could perhaps not easily do; we had did not try to design such a scheme.
> II. VMware does this and MinWin claims they will too.
> III. The VM provides a natural cleavage point to do this.
I'm at least a little bit surprised by pretty much all
of the above. Let me address them one at a time:
I. You say that Keykos (perhaps?) could not easily move an
application from one 'platform' to another:
I believe that with a mechanism like the DCCS (which I
believe KeyKOS could support?) or MarkM's "vat" mechanism
there is a straight forward way to move a running application
from one 'platform' to another.
<details - may be skipped and come back to if needed.
Continue to the /details below.>
First let me say that what I mean my 'platform' is a
computer with the same architecture (or emulated/simulated
architecture) - including the same instruction set,
memory architecture, etc. By an 'application' I mean
a program running under a descriptor based capability
OS (with a c-list) with only a blocking "invoke" call
with RPC semantics. For example, the 'Capability
Computing System' (CCS) described in the DCCS paper:
http://www.webstart.com/jed/papers/DCCS/
or the RATS system:
http://www.webstart.com/jed/papers/RATS/RATS.pdf
that the the CCS was based on (with the minor corrected
noted elsewhere on this list - irrelevant here I think).
To move a running application I assume that whatever is
moving the application has access to all it's state -
it's memory, registers, and it's c-list.
If the application is running (i.e. not waiting on
an invocation), then it seems to me the moving is
particularly easy. Stop the application, create a
new 'process' with the same memory and register set
on the new 'platform', then simply send the capabilities
in the applications c-list to the new platform (going
through the DCCS mechanism in the process) and put
them into it's c-list. After that it should be possible
to start the process and have it pick up from where it
left off.
The one complication that seems to me to come in is
when the process happens to be in the middle of a
blocked invocation when the effort to move it is
initiated. In principle the process could be stopped
and the effort to move it could wait until the
invocation completed. Then the move could happen
as above. This isn't very satisfactory because
an invocation would take an arbitrary amount of
time to complete.
To deal with the case of a blocked invocation
goes beyond what I imagined in the DCCS mechanism
(partly why the topic interests me). I see two
relevant design issues. One (A) is a matter of
authority and communication. Namely, how does
the moving application show that it has the
authority to move the point that a reply should
be delivered to, and how does it communicate to
the network emulation server to effect such a move.
The other (B) is what amounts to a race condition
regarding the state of the invocation. At the time
the move is initiated the reply for the invocation
may be already winding it's way back across the
network - or not.
I believe both can be dealt with, though with
some difficulty (again why the topic interests
me). Regarding A (authority and communication\
to the emulation), I believe having a capability
to the process to be moved (assumed) should suffice
for authority. The communication is a bit more
awkward. Since I'm making this up on the fly,
I suggest a capability to communicate to the
network emulation facility for this purpose.
This capability will support a "Block" request
(passing in a Request Capability) and a
"Resume" request, passing in a "Blocked"
capability. Of course the process doing
the move would have to have such a capability.
I'll assume a call on the Process that can
produce a Request Capability (duplicate to
what the network server presumably has) for
a process that is stopped in an invocation.
Regarding the race condition (B), when
a "Block" request comes into the network
emulation server, it will find the ongoing
request (no need for EQ, more about that if
interest) and then send a "Block" request to
the remote end. That request will either
return "Too Late" or a 'Resume' capability
that can be returned to the moving process
to enable it to set up the blocked
invocation on the destination platform.
The steps then to move a process are:
1. Stop the process.
2. Check to see if it is in an invocation.
If not, move it simply as above. If so:
3. Issue the "Block" request on the
network emulation server.
4. If the return is "Too Late", continue
as if the process was running when stopped.
If the return is "Blocked" then:
5. Send the 'Resume' capability along with
a 'Request' capability on the new platform
to the Network Server on the new platform.
I won't go into the details for the setup
on the new platform, but I hope that the
details above suffice to indicated that it
has what it needs. (whew).
</details>
II. Regarding "VMware does this and MinWin
claims they will too." where 'this' is to
move running applications - I'm skeptical
of this claim. I have heard and can
understand a more limited claim. As
long as a VM can be moved to another
platform that can assume the same network
address (or addresses), then I can accept
that this can work (even then there may
be some hazards). However, if the destination
"platform" is in a different part of the
network with a necessarily different
network address (e.g. IP, LAN), then I
don't understand how this can work.
Relevant cases to consider are a server
listening on a port and a client or server
in the middle of a remote transaction.
Of comparable seriousness are any other
device issues that may be involved (e.g.
are the same devices there and supported)
and certainly that may be in the middle
of transactions.
III. Regarding "The VM provides a natural
cleavage point to do this." - to me it
doesn't, precisely because of these device/
communication issues. In an object/capability
system requests are abstracted from any
underlying device (e.g. network) infrastructure.
Because of this I believe the moving of
a running application can happen in a
relatively straight forward way along
the lines I outlined above (details). I
don't see how this can happen when moving
a VM with all the device issues that may
be ongoing - including, importantly,
network communication.
To me an object/capability interface seems
a much more natural 'cleavage point' for
moving running code because of its
greatly simplified interface (only
a generic "invoke").
I'm not sure how practically relevant this
business of moving a running application is,
but the topic interests me in any case, so
I'll be interested to hear the views of
others on this topic.
--Jed http://www.webstart.com/jed/
More information about the cap-talk
mailing list