Architecture of Backing Store Descriptors

Bryan Ford baford@schirf.cs.utah.edu
Thu, 24 Nov 94 09:21:25 MST


>   One example would be a user-level disk device driver that is used
>   to page kernel data: it's perfectly OK to put it outside of the
>   kernel, as long as you recognize that it's still globally trusted.
>
>I understand what you are saying, and there is merit to it.
>
>For the kernel to entrust privileged state to a user-mode device
>driver that is part of the TCB is fine.  I wasn't arguing against
>that. In the example you give, the kernel can still make progress on
>other tasks while the pageout is in progress.
>
>The question is whether it is "safe" for the kernel to *block* on such
>a process. In my opinion this is unsound.  The opportunities for
>deadlock that arise when this is done are a big problem, and it is
>almost impossible to show that such deadlocks cannot occur.

Ah, I see.  What you're really talking about here is the classic
problem of upcalls; it's nothing specific to kernels.  You have
to be very careful when using upcalls in any part of a system,
but there are well-defined rules that can be followed to ensure
that upcalls will not cause deadlock or other similar problems.
If you haven't already seen it, you might want to read the famous
paper "Structuring Systems with Upcalls", or something like that,
in one of the older SOSP proceedings (I can dig up the specific
reference if you want).  You're right, upcalls are dangerous if
used naively, but disallowing them completely is like cutting off
half your fingers because they sometimes get in the way.

>[...how to determine when a persistent object is dead...]
>
>   How does KeyKOS do this?  Is there a reason why traditional techniques
>   like reference counting won't work?  
>
>Yes. How do you know which random bits in a random data page are
>actually bits that constitute an outstanding capability?  The process
>may allege to have released the capability, but there's no way to be
>sure that it really has.

As you've pointed out before, in the KeyKOS and Mach models,
random bits in random user-level data pages _never_ constitute
capabilities; only bits in kernel data pages can.  In Mach,
capabilities are represented in the kernel as pointers to
reference-counted data structures; both the pointer itself
and the thing it points to is always in kernel memory, and
never available to untrusted code.  Therefore, reference counting
works fine.

My original point was that reference counting should still work
fine even in the presence of full-system persistence: just checkpoint
all the kernel virtual memory holding the pointers and the referenced
objects, along with the untrusted user-level data pages.  When you
reset from the checkpoint, all the pointers will still be there,
will still point to the same things, and all the reference counts
will still be valid.

>KeyKOS doesn't reference count.  Instead, it keeps a serial number
>(called an allocation count) in each object.  When the object is
>reassigned, the allocation count is incremented.  Keys also contain
>the allocation count, and if the counts don't match the key is
>invalid.  This avoids the need to GC the disk.  Some further tricks
>are used to avoid bumping the count if none of the keys to the current
>object generation have gone to disk.

Well, that's even easier, and just fine if objects are always explicitly
destroyed/reassigned, rather than being "freed" automatically when
there are no more capabilities pointing to them.  I had the impression
that KeyKOS supported some kind of "no-more-capabilities detection",
because you suggested that implementing that was a problem.

BTW, how does KeyKOS deal with the risk of an allocation count wrapping
and "re-activating" long-dead keys?

>   If I could freeze a Mach system... there is a little additional
>   complexity in figuring out which kernel data is persistent and
>   which isn't, and how to handle interactions between persistent and
>   non-persitent data; but for the most part I think this is a
>   separate and orthogonal problem.
>
>It is seperate and orthogonal, but I think you underestimate the
>challenge, as well as the implications for kernel heap management
>(holes on the heap currently go away when the kernel reboots, at
>least!).  When you delete a port, how do you ensure that port
>authorities that have not been paged in yet are properly invalidated?
>Once the system is persistent in the fashion you suggest, there are
>likely to be far more processes by several orders of magnitude than
>you are perhaps used to thinking about.

OK, you're right about that one.  Under Mach's current strategy,
objects that get destroyed will remain allocated until all capabilities
pointing to them go away naturally, which in a persistent system
would create a security problem as well as a storage management
problem (you can revoke the right to use an object, but you can't
force the kernel memory containing the object to be releeased).

However, it might actually not be too difficult to fix Mach to
solve this problem.  All internal Mach kernel allocation is already
based on a zone system, which keeps objects of different sizes in
different memory pools.  Each object already has some kind of
"active" flag that indicates when an object has been terminated;
the active flag could be converted to a KeyKOS-like allocation count
that would allow the object's memory to be reallocated before all
outstanding references are dropped.

Then the only remaining difference from the KeyKOS approach would be 
that kernel objects can consume different amounts of memory.  The
advantage would be better packing of kernel objects; the disadvantage
would be that kernel memory in one zone could never be migrated
to another zone (e.g. a port can never become a thread or a task).

I can definitely see the benefits of KeyKOS's fixed-sized "nodes" in this
respect.  In fact, you're half way to convincing me that it would be a good
idea to change Mach's zone allocation system to go "whole hog"
KeyKOS-style, with fixed-size kernel objects; it might not be too
difficult, and I'd really like to support full-system persistence in Mach
as an option some time before too long.  I've been planning to make some
significant changes to the kernel memory allocation system before too long
anyway, to support multiple kernel virtual memory pools.

>   BTW, once again, I understand this is exactly how L3 works - all
>   kernel memory is virtual and pageable...
>
>Can you send me a reference to some papers on L3?

There was one in SOSP93, one in IWOOOS93 (that's the one I've seen
that deals with persistence in L3, but it isn't very detailed), as
well as a few less-relevant ones in other places.

>   >Consider a pernicious client process that has created a meter with no
>   >keeper and allocate 5 ticks of runtime to that meter.  The client
>   >calls a shared service which runs on the client's meter.  The quanta
>   >runs out midway through the service, and the client perniciously
>   >refuses to refill the meter.  Denial of service.
>
>   ...to the client itself.  Big deal.  As long as the server is correctly
>   written under the "migrating-resource" model, a malicious client won't
>   impede the functioning of the server for other clients.
>
>I think you're mistaken.  The server can no longer service anyone,
>because it is halted in the middle of processing the client's request,
>and is therefore not waiting for requests.  In mach, I suppose this
>only ties up a single server thread, but threads are not free!

You're still not thinking "migrating threads"! :-)  The thread
(schedulable entity) is supplied by the client, and the activation
it runs on in the server, while being in the server's address space,
is allocated from the client's pool of server-trusted resources.

As William Franz just pointed out, this can be done manually in KeyKOS, at
least to some extent.  The equivalent situation in KeyKOS would be a client
providing the server with some server-trusted pages, a node with which to
create a domain in the server dedicated to servicing the client's requests,
and a meter supplying CPU time to that domain.  If the client stops
supplying CPU time to the domain, it'll stop making progress, but will only
be wasting the client's own resources.  (This assumes a priority
inheritance mechanism exists for locks in the server - does KeyKOS support
that too?)

(BTW, note that the client-directed memory allocation isn't yet implemented
in Mach, so we still have a big security hole at the moment... :-) )

>   Similarly, you might object that the stopped thread may be holding locks in
>   the server and preventing other clients' threads from making progress; but
>   in fact this is just the classic unbounded priority inversion problem, and
>   any traditional priority inheritance scheme will solve the problem.
>
>What this has done is transfer the burden.  The pernicious client has
>succeeded in imposing extra work on other clients.

Not really, because if the client wasn't pernicious, its domain in the
server would still need to contend for those locks with other server
threads during normal, proper functioning.  Locking overhead is a fact of
life, which servers know how to work around.  In the case of a pernicious
client, once the client stops supplying CPU time, the thread in the server
will get "pushed" to a state in which it is not holding any active server
locks and will not cause additional overhead thereafter.  The locking
overhead that can be caused by the pernicious client is basically some
fixed constant times the number of locks held at the point the CPU gets
taken away.  That upper bound in overhead is no more than could be expected
to occur once in a while anyway, during normal lock contention.

>   ...  Thus, the KeyKOS kernel operates purely
>   under the "client-provided-resuorces" model, even though for user-level
>   programs it only directly supports the traditional model.
>
>I must be missing something obvious.  It seems to me that user-level
>programs in KeyKOS operate purely under "client provided resources" as
>well, with the arguable exception of those processes that directly own
>authority to hard physical resources like the disk.  In what sense is
>this not true?

As Willam Franz pointed out, it's more true than I had thought.  Indeed,
the "client-provided, server-trusted space bank" model for memory
allocation is essentially what I've been planning to move Mach toward.  The
only thing that I suspect KeyKOS may still be missing is priority
inheritance on locks; if that's the case, it means servers can only accept
trusted meters.  Probably not a really big limitation, but we're planning
to support priority inheritance in Mach for other reasons anyway (although
possibly as an option, depending on how much overhead it incurs).

BTW, sorry for such long messages; but I think we're really making good
progress in coming to an understanding about how NewSys and/or Mach should
be designed.

				Bryan