GC & The "Expensive Object" Problem
Mark S. Miller
markm@caplet.com
Thu, 06 May 1999 00:55:26 -0700
I've always been amazed that Python refcounts rather than GCs. Lots of
empirical work has shown that GC is more efficient than refcounts (even
ignoring cycle-leakage of course), and this makes sense on first principles:
refcounting costs proportional to the number of pointer operations,
whereas gc costs approximately proportional to the number of allocations.
Ping pointed out to me an interesting advantage of Python's refcounting
scheme: some objects retain external resources, like file descriptors, and
naturally release these when they are finalized. Refcounting certainly
succeeds at getting rid of objects immediately, whereas gc may delay their
collection arbitrarily. Worse, efficient garbage collectors are
generational. This makes worst-case collection times much longer, since the
garbage object may have been tenured into a generation that's only examined
every blue moon.
Normally, the delayed collection of gc vs refcounting isn't a problem, since
the only extra cost is storage, and when we need more, we gc anyway.
However, when the object is retaining some externally-expensive resource,
the gc has no knowledge of this expense, and so optimizes the wrong thing
out of ignorance.
Even for these expensive objects, we don't particularly need the immediacy
guarantee that a refcounter provides, we simply need assurance that there
will only be a reasonably small delay -- ideally bounded -- between when the
object becomes garbage and when it's finalized. In talking about it, Ping &
I came up with a plausible scheme:
The Miller-Yee Scavenging Variation
Assume an object can statically self-declare itself "expensive" in such a
way that the garbage collector can tell. By "statically", here I mean that
it's a permanent creation-time attribute of the instance (or even class).
In Java, one might use a marker interface. In a generation-scavenging gc,
these objects are allocated in ephemeral space, as are all others. However,
no matter how often they survive a scavenge, they are never tenured into an
older space. Instead, they are copied into the beginning of the new
ephemeral space. If too much time goes by without a scavenge, we can use a
timeout to scavenge anyway, just to ensure that expensive garbage gets
collected promptly.
Rather than allocating a fixed amount of memory to ephemeral space, we
instead allocate a fixed amount of head-room: the remaining free ephemeral
space above the surviving expensive objects. If we end up with too many
expensive objects, a scavenge can end up taking unpleasantly long, but if
expensive objects aren't rare, one is in a truly expensive situation.
Elaborations with more generations or distinctions are certainly possible,
if distressing.
This is all relevant, long term, for distributed objects (whether E or RMI),
as a Proxy is a perfect example of an expensive object. A Proxy retains
storage of an object in a foreign address space, beyond the awareness of its
host gc. Declaring a Proxy expensive advises the local gc to spend somewhat
more time in the hopes of saving other vats possibly much more space. (This
assumes a desire to cooperate without compensation, as is quite common in
computation. When compensation would help, more interesting arrangements
would be called for.)
Since it would benefit RMI as well, there's more hope of convincing some
clean-room jvm competitors that it may be worth a shot.
Cheers,
--MarkM