IPC performance -- cap copy
shapj@us.ibm.com
shapj@us.ibm.com
Sun, 26 Dec 1999 10:44:40 -0500
As those of you on the eros-arch list know, I have been unhappy with the
cost of the capability copy in the EROS IPC implementation. This morning, I
had what is either a brainstorm or a brain-misfire. As it involves the
application of general and well-known techniques in obvious ways (anybody
who groks GC would suggest it), I have concluded that it is not patentable,
and that I can therefore ask the list to check my analysis of the
performance costs.
The following is concerned with examining the cost of copying a capability
from a source slot to a destination slot during IPC. Because of the linked
list issue (described below), I have been considering moving to a more
garbage-collection oriented design for a long time. In the interim, I've
been looking for a way to sanity check whether this is a good idea without
overhauling the entire kernel. I believe that I have found one.
I would greatly appreciate it if you might take a moment to check my
overhead analysis on the obvious alternatives to see if I may have missed
something.
CURRENTLY:
On average, 80% of the capabilities in the capability registers of the
cached processes are prepared. They are all either gate or memory
capabilities, and if you have them, you needed them. Every prepared
capability is in a doubly linked list.
If the capability in the destination slot is prepared, you must first
unlink it from it's doubly linked list. In practice, its neighbors are
never in the data cache, so we have, on average
80% * 2 miss == 1.6 miss avg.
After the copy, the source capability is known to be in the data cache, but
it's neighbor is never in the data cache. We therefore take
80% * 1 miss == 0.8 miss avg.
or an average of 2.4 cache misses per capability copy. Since cache misses
run 20-40 cycles these days, I'm looking for a way to be rid of that.
ASIDE:
I thought about adapting techniques from register rename, but all of this
operations have copy semantics, and register rename doesn't help with that.
NOTION:
This morning's brainstorm was to change the process structure as follows:
1. Place all capability register values (i.e. the capabilities) in a side
table. If there are N processes in the process cache, this side table has
at least N * 32 + R entries. Initially all entries in the side table are
placed on a singly-linked free list.
The reason to have R extra entries is that every resume capability that is
generated requires a new slot, and the initial condition is that all slots
are occupied. Since, on average, every other IPC creates a resume key, we
may assume that the collection of dead entries must occur every 2R IPC
operations.
2. Each process carries *pointers* to entries into this table rather than
the capabilities themselves. When capabilities are transferred during IPC
the pointers are copied directly, with the result that entries in the
capability table may become aliased.
The question is how to collect dead entries in the table, and how often one
is forced to do so.
** Strategy 1: Mark/Sweep GC
In this strategy, we build a second side table of mark bits. When the free
list is exhausted, all mark bits are cleared. Whether 0 or 1 is considered
clear may be architecture dependent. The important part is that "clear" and
"set" have distinct values. The algorithm now runs through the process
table entries picking up the pointers into the capability table. It
determines, for each pointer, the offset into the capability table of the
corresponding entry by computing
offset = (ptr - capTabBase) / 4
and sets the corresponding mark bit. It then makes a sequential pass
through the mark bits, placing all remaining clear entries on the free
list.
The key to efficiency in this strategy is to keep the mark bits table in
the cache. The linear passes through the process cache and the capability
table do not have deeply chained data/control dependencies, and will
execute well in a modern superscalar (P-II/P-III) or EPIC processor.
If we have N entries in the process cache, each with 32 registers, the mark
bits occupy N+R/32 words. This figuer is both adequately generous for most
uses and fits in the cache comfortably. At 1024 entries or above I would
begin to consider blocking processes and side tables in clusters of 1024 in
the interest of collection locality.
On the x86, the mark table will require (N + R/32)/8 cache misses to load
(32 byte cache lines) into the cache, the register slots (the pointers)
will take (N * 32) / 8 cache misses to traverse, and the free list will
take (N * 32) / 2 cache misses to reconstruct (because capabilities are 16
bytes each, this pass has alignment constraints).
So the total cost of the collection, ignoring conflict misses, will be
N/8 + R/4 + N*4 + N*16
or
=~ N*20 + R/4
and this cost will be incurred every 2R invocations. The overhead per
invocation is therefore:
(N*20+ R/4)/2R == 10N/R + 1/2 =~ 10N/R
If we choose N and R at 512 and 2048 respectively, we get a per-invocation
overhead of 2.5 cache misses, incurred in chunks of 10752 misses every 4096
invocations.
While it's clear that we can adjust R to adjust the overhead, the
magnitudes of the numbers involved don't strike me as promising. The real
problem lies in the generation of the resume key, and the fact that this
means every other IPC grabs a free list entry. Otherwise, we could incur
this cost every 2R process *unloads*, which is a much more reasonable
number and would let us do the entire collection incrementally. More on
this in a moment.
Strategy 2: Reference Counting
The second strategy is to use reference counts instead of mark bits. When a
capability is copied, the count associated with the target slot is
decremented and the count associated with the source slot is incremented.
Count values of zero indicate available slots.
In practice, neither count value will be in the cache, and they are
unlikely to reside in a common cache line. This design therefore places a
minimum penalty of 2 cache misses per cap copy.
When a count goes to zero, we can either place the slot on the free list
immediately (one cache miss per IPC to get the free list head, but no cache
miss per cap copy) or we can defer the reclamation to a later GC pass,
which takes (32N + R)/8 cache misses every 2R (worst case) IPCs, so:
(4N + R/8) / 2R == 2N/R + 1/4 =~ 2N/R misses per IPC.
Summary:
No straightforward design leaps out that solves the problem. The reference
counting strategy is a slight improvement, but probably not worth the cost.
It's better than it looks because the zero number capability doesn't need
to be refcounted and can be encoded by a zero pointer, but still...
The problematic issue is the rate of capability fabrication -- or more
precisely, the rate at which capabilities are loaded into the capability
registers by operations *other* than IPC-driven copy.
As an aside, note that this rate is no more than one per invocation. IPC's
generate a new resume capability, and no kernel invocation produces more
than one new capability. Kernel invocations account for less than 20% of
invocations. Further, most kernel invocations yield a straightforward
transform of an inbound capability. If some filtering trick could be used,
and if resume capabilities could be cleverly encoded, the rate would drop
much further.
I am off pondering a hybrid design.
Jonathan S. Shapiro, Ph. D.
Research Staff Member
IBM T.J. Watson Research Center
Email: shapj@us.ibm.com
Phone: +1 914 784 7085 (Tieline: 863)
Fax: +1 914 784 6576