Happy Happy Joy Joy (was: On to Hydro)

Jonathan S. Shapiro shap@eros-os.org
Wed, 23 Aug 2000 14:04:00 -0400


> I see specifying the number of buckets to start with as a separate
> issue from specifying the hash function

This doesn't make sense to me, as different numbers of buckets tend to go
with different hash functions. A single hash function will not generate good
distributions for arbitrary selections of bucket counts. This is
particularly true when the number of buckets is prime.

> In a persistent programming language, there is no difference between
> these two cases. The hash code is always calculated based on the value
> of the immutable key object. The reference to a mutable object can be
> thought of as, and is implemented as, an immutable token identifying
> the object. When you insert the reference into a hashtable, you are
> actually inserting this immutable token.

I think you meant in an object-oriented language?

In your design, there appears to be no way for the hash implementation to
know if it should hash on the object or some sub-portion of the object.

I am currently building a system in which I rely on being able to compute
cryptographic hashes of the content using an algorithm that I specify. In
the case of some objects, a portion of the content is generated "on the fly"
while generating the hash.

I use this, among other things, to compare distinct objects -- one that is
written and therefore an "entity" object and another (a WorkspaceEntity)
that is preparing to be written and may (if written) *become* an entity
object. The goal is to determine whether the WorkSpace entity differs from
its predecessor, which is an Entity. It should only be written and uploaded
if it differs.

This case has several properties that seem to bear on our discussion:

1. I must control the hash function, since the output is being used as an
identifier and the specific choice of hash function is therefore crucial. It
would be most unfortunate to be forced to use some other hash internally.
2. There isn't any "key" in the sense that you seem to be describing.
3. I am really hashing two distinct object types based on the content of
their respective serializations. Hashing on identity therefore doesn't help.
4. There isn't enough memory to serialize the objects to strings and then
hash those -- a hash function that internally streams the object is
required.

> I don't see a reason to support multiple hash codes, and it's
> certainly possible for the runtime to generate a good hash of the
> object's value.

I don't believe that in the case I have described the runtime has sufficient
information to know what to hash.

My point, I think, is that one wants to allow a layer of indirection in both
comparators and hash operators. That is, one wants in general to always
compute

    hash(f(object))

where f() is most commonly, but not necessarily, the identity operator. My
rationale for allowing the user to control the hash function is that it both
allows us to change f() and also allows the result to be used elsewhere in
the system.

shap