GC & The "Expensive Object" Problem

Mark S. Miller markm@caplet.com
Thu, 06 May 1999 00:55:26 -0700


I've always been amazed that Python refcounts rather than GCs.  Lots of 
empirical work has shown that GC is more efficient than refcounts (even 
ignoring cycle-leakage of course), and this makes sense on first principles: 
refcounting costs proportional to the number of pointer operations, 
whereas gc costs approximately proportional to the number of allocations.

Ping pointed out to me an interesting advantage of Python's refcounting 
scheme: some objects retain external resources, like file descriptors, and 
naturally release these when they are finalized.  Refcounting certainly 
succeeds at getting rid of objects immediately, whereas gc may delay their 
collection arbitrarily.  Worse, efficient garbage collectors are 
generational.  This makes worst-case collection times much longer, since the 
garbage object may have been tenured into a generation that's only examined 
every blue moon.

Normally, the delayed collection of gc vs refcounting isn't a problem, since 
the only extra cost is storage, and when we need more, we gc anyway.  
However, when the object is retaining some externally-expensive resource, 
the gc has no knowledge of this expense, and so optimizes the wrong thing 
out of ignorance.

Even for these expensive objects, we don't particularly need the immediacy 
guarantee that a refcounter provides, we simply need assurance that there 
will only be a reasonably small delay -- ideally bounded -- between when the 
object becomes garbage and when it's finalized.  In talking about it, Ping & 
I came up with a plausible scheme:


		The Miller-Yee Scavenging Variation


Assume an object can statically self-declare itself "expensive" in such a 
way that the garbage collector can tell.  By "statically", here I mean that 
it's a permanent creation-time attribute of the instance (or even class).  
In Java, one might use a marker interface.  In a generation-scavenging gc, 
these objects are allocated in ephemeral space, as are all others.  However, 
no matter how often they survive a scavenge, they are never tenured into an 
older space.  Instead, they are copied into the beginning of the new 
ephemeral space.  If too much time goes by without a scavenge, we can use a 
timeout to scavenge anyway, just to ensure that expensive garbage gets 
collected promptly.

Rather than allocating a fixed amount of memory to ephemeral space, we 
instead allocate a fixed amount of head-room: the remaining free ephemeral 
space above the surviving expensive objects.  If we end up with too many 
expensive objects, a scavenge can end up taking unpleasantly long, but if 
expensive objects aren't rare, one is in a truly expensive situation.  
Elaborations with more generations or distinctions are certainly possible, 
if distressing.

This is all relevant, long term, for distributed objects (whether E or RMI), 
as a Proxy is a perfect example of an expensive object. A Proxy retains 
storage of an object in a foreign address space, beyond the awareness of its 
host gc.  Declaring a Proxy expensive advises the local gc to spend somewhat 
more time in the hopes of saving other vats possibly much more space.  (This 
assumes a desire to cooperate without compensation, as is quite common in 
computation.  When compensation would help, more interesting arrangements 
would be called for.)

Since it would benefit RMI as well, there's more hope of convincing some 
clean-room jvm competitors that it may be worth a shot.


	Cheers,
	--MarkM