Re: merge triple identification Jonathan S. Shapiro (shap@eros-os.org)
Wed, 28 Jun 2000 09:43:40 -0400

Okay. I've been thinking about how PRCS uses the entity identifier for merge, and I've come to two conclusions:

  1. It's a good idea.
  2. I'm going to have to use a variation.

Josh: since you've been down this rathole already, could you check my assumptions below? In particular, are there other places where family ID's are essential that I may not yet have noticed?

Background:

The entity identifier, which I think of as the "family identifier" is common to all versions of a particular object. The general idea is to have a name for the set of all versions of "foo.c". I originally avoided a family identifier because I wanted to be able to do join/split, but I subsequently concluded that join/split was a bad idea.

Onwards:

The main problem with family identifiers is that it is not clear how to generate them in a globally unique way. In the current PRCS design (as I understand it) these numbers are locally unique on a given server, but two servers may assign family identifiers to the same content in different ways. I haven't thought this out fully, but I think that given the way DCMS does replicate on demand you can get into trouble. That is, given an object with five versions

o.1 o.2 o.3 o.4 o.5

It's possible that replication will have caused o.1 and o.2 to have been replicated generating one family identifier, o.4 o.5 replicated later to generate a different one, and since o.4 isn't present the connection won't get made between the two chains.

One strategy I thought about for generating universally unique family identifiers was to simply use the SHA-1 of the initial checkin, but somebody will inevitably check in a bunch of files that contain identical copyright messages (and nothing else) and thereby generate the same family identifier for independent objects.

Assumptions:

In practice, however, it appears to me that the family identifier doesn't need to be universally unique. When you traverse an object's history you use the predecessor chain, not the family identifier, so it's not used there. The main purpose of the family identifier appears to be narrowing the search space during the merge algorithm. In fact, I don't really see any other purpose for them.

Proposal:

If this is true, then it is sufficient to have a good enough random number. I think I'ld probably concatenate this with the SHA-1 hash of the first object that is a member of the family, and use the resulting number as the family identifier. This is NOT a globally unique identifier. In fact, I hereby christen this number the "NOID" (Not an Object IDentifier).

When performing a merge, you proceed as follows:

First sort the objects in the current and merge sets according to their noids. Some will be singletons (i.e. they appear in one set but not the other). Move the singletons to a separate singleton list for later processing.

Now begin the real work by considering the objects that are left. Partition these by noid. There should be a large number of partitions (roughly equal to the number of distinct object pairs in the merge), each consisting of a small number of objects (usually two). For each partition, apply the expensive exhaustive search described in my original mail until either (1) no further objects remain or (2) the algorithm has exhausted the search within the partition. In the latter case, stick all remaining objects in the singleton list.

Using the noid has the effect of greatly reducing the complexity order of the search. The algorithm itself doesn't get any better, but instead of needing to do an all-pairs comparison, you can almost always use a per-pair comparison. Further, in the bad cases you still have reduced the "all" to
"all items that collide on a noid".

If no common ancestor is found, we presume that this was an accidental collision and move the two objects into the respective "singleton" lists -- i.e. into the lists of objects that appeared in one configuration but not the other.

Unless I missed something obvious, this ought to reduce the order of the algorithm about as far as is possible.

Oh yes. The real reason to call it a noid is that when two many objects need to be considered at once I get annoid.

Sorry. Bad nomenclature puns are required by Bell Labs, EROS Group, and Xanadu traditions.

shap