[eros-arch] large memories, mappings, and SMP
Jonathan S. Shapiro
08 Aug 2002 14:49:04 -0400
As we restructure the EROS kernel, I have been considering the
mapping-related issues that go with large memories. In short form, the
question is: now that the kernel virtual space is too small to hold a
complete map of physical memory, how do we adapt things so that this is
not required? For the most part, the kernel doesn't *need* mappings for
pages, so in practice this isn't all that hard to do. The main issues
are defining what the new kernel virtual map should look like and
figuring out how to deal with mapping pages.
Mapping pages are a nuisance, and they will come to be *more* of a
nuisance in an SMP implementation. Here are my present thoughts on them
and a couple of questions at the end.
The EROS kernel accesses mapping tables in three situations:
1. as part of "FromSpace" in a capability invocation
2. as part of "ToSpace" in a capability invocation
3. as part of the cleaning mechanism
The last is the main source of difficulty, as I will shortly explain.
But first, some observations:
+ To manipulate PTEs the containing mapping table must be
mapped into kernel virtual space
+ While these mappings can be done ephemerally, it would
be desirable to have some very fast bulk mapping mechanism
for the capability invocation path so that the kernel
virtual addresses of PTEs implicated in the operation
can be easily computed and these PTEs can be easily
+ The available kernel virtual space is insufficient to
allow the kernel to simultaneously map the *mapping tables*
of all processes unless they are mapped on a "page at a time"
basis. A full set of L1 mapping tables potentially would
occupy a 4Mbyte window of kernel virtual space, and
that is big enough that the kernel would not be able to
support a sufficient number of windows.
+ In practice, of course, mapping tables are very sparse.
The kernel certainly *can* map simultaneously all of the
mapping tables that are actually in use at any given time.
What it cannot do is map them in such a way as to make
address computations to locate some particular PTE simple.
Let me momentarily ignore the cleaner, and describe what I have in mind
for the next-generation uniprocessor kernel. I'll then talk about SMP,
and the issue that is currently disturbing me [I'm thinking about SMP
now because this stuff is hairy enough that I don't want to have to
rearchitect it again.]
Self Windows and To Space Windows
My plan at the moment is to reserve a 4M-aligned window within the
kernel *virtual* address map where every process will have it's own
mapping tables mapped. I will call this the "self space window". These
mappings will be supervisor-accessable only. Whenever you introduce a
new L1 page table (the one whose entries point directly to the pages),
you have to write a new entry into the top-level page table anyway. The
idea is that whenever you do this you simultaneously insert an
essentially identical PTE into the kernel-accessable self space window.
This ensures that the kernel always has the page tables for the current
process mapped in kernel-accessible form, and in such a way that PTE
offset computations are very simple.
Similarly, every address space will have a reserved 4Mbyte kernel-mode
window for the mappings associated with "to space". Unlike the from
space window, which is permanently associated with a given process, the
to space window is ephemeral: it is established during capability
invocation by copying a PDE (page directory entry -- a page table entry
in the upper level mapping table). If a data string is being moved into
a new process, the kernel will at some point early in the path perform:
Since this mapping is ephemeral, care must be taken to ensure proper
invalidation when a kernel invocation blocks, but that is not unduly
The above design is mildly augmented by having a unique, reserved L1
mapping table containing no valid mappings. Whenever a page table is
removed from an address space, it is replaced by a reference to this
invalid table in both the upper level page table and also in the self
space window. The purpose of this is to maintain the property that the
PTE for any given location in the current process or the to-space
process can always be validly read by a straightforward "offset and
read" computation -- all locations in the self space window or (if
mapped) the to space window are valid. That is, each has a PTE. The PTE
itself may not have its valid bit set. The key point is that a blind
offset and read operation into the self window or to space window will
never lead to a kernel page fault.
This design should make interprocess invocations somewhat simpler than
they are now. When coupled with the forthcoming capability register
rename trick I think they will speed things up quite a lot. In
particular, the receive string destination, if valid, can now be mapped
in a single word move as a side effect of establishing the ephemeral
to-space mapping. I'm a bit embarassed that I didn't think of this years
However, we now come to a problem: the cleaner and it's interactions
with the self space and to space mappings.
The Cleaner Problem
In a uniprocesor, the cleaner can conceptually run during any
invocation, but in practice its operation is well localized and all of
its operations can safely be performed using ephemeral mappings. The
main case where this arises is when the cleaner must invalidate PTEs.
The victim page table is not necessarily a page table in either the
invokee or invoker address space. Therefore, the cleaner needs to
proceed pessimistically by creating an ephemeral kernel mapping for this
page and using that mapping to appropriately overwrite the PTEs.
Alternatively, we could reserve yet *another* window of the kernel
virtual space to be a window in which all active mapping tables are
mapped (in no particular order), and use this rather than an ephemeral
mapping. In some respects I favor this, as invalidating ephemeral
mappings can get pretty expensive. My concern is that the proper choice
of size for this window is not immediately obvious, and is not likely to
be robust as main physical memory sizes continue to scale up.
In the current uniprocessor implementation the cleaner sets a global
flag "PteZapped" whenever it alters a PTE as a consequence of cleaning.
The capability invocation path, when it reaches the final go/no-go
decision (the COMMIT_POINT) checks this flag. If PteZapped is set, it
means that any assumptions previously established concerning the
validity of mapping entries during the invocation setup phase may have
been violated, and the invocation is restarted in order to re-check the
mapping entries. This works because it is guaranteed that the cleaner
cannot be invoked following the COMMIT_POINT(). It's very clean on a
uniprocessor, but I realized last night that it won't work at all on an
The problem on an SMP is that there is a fundamental race condition
between the cleaner and any other CPU performing a kernel invocation.
The nature of the race is that the cleaner and the invocation might run
on different processors, and might interleave arbitrarily.
It's fairly obvious that engaging a lock on a per-mapping-table basis
for the duration of each invocation can overcome this race, but I am
concerned about the potential expense of such locks. Basically, we need
to perform an "ephemeral lock" of each mapping table involved in the
current capability invocation to ensure that such a mapping table will
not be altered by a cleaning agent running on some other processor.
The self space and to space windows help, because they allow us to
quickly determine the physical address of the mapping tables, and
therefore to determine the corresponding core table entries, which is
where the lock must reside.
The good news is that we have a very elegant mechanism for doing
ephemeral SMP locks of this kind which is very low overhead. The bad
news is that we still need to grab the locks, and this is a substantial
overhead in the fast interprocess path.
So finally my question:
Does anybody see a means to prevent a priori this interaction between
the invocation path and a cleaner on another processor WITHOUT the need
to do an ephemeral lock on each implicated page table? Offhand, it seems
to me that this is only possible if interleaving of the cleaner and
string-carrying invocations can be precluded. The only way I can see to
do this is to run the cleaner exclusively, and doing that would seem to
be unscalable and contrary to sound multiprocessor design.
I welcome suggestions, thoughts, or pointers to helpful design
constraints I may have missed that might dig us out on this. Offhand I
do not see one here.