Re: Another PRCS/XDELTA question for Josh Josh MacDonald (jmacd@CS.Berkeley.EDU)
Tue, 18 Apr 2000 10:30:50 -0700

Quoting Jonathan S. Shapiro (shap@eros-os.org):
> Josh:
>
> I want to make sure I have enough information expressed at the server
> interface to be able to apply XDELTA or XDFS later. At the moment, I simply
> have an interface of the form put(universal-name, byte buffer). I think this
> should probably be augmented with:
>
> revise(old-entity-universal-name, new-universal-name, byte buffer)
>
> If I use an interface in this style, does it provide enough information that
> XDELTA and your cool DB-based file system can get the job done?

First I'll say that you might want to avoid keeping a whole file in memory. The interface is very flexible. You provide a "family identifier" up front to associate related versions. This would be the name of an RCS file, for example. All of the XDFS calls operate in the context of a family, so you have to keep track of which versions are related to each other. Then you can simply say (roughly speaking) xdfs_insert (family, file handle). XDFS will read the file handle you supply, do its work, and return an Inode to the caller.

> The reason I ask is that your master's thesis appears to assume that files
> are sequenced objects (I may have failed to understand something here). This
> creates a problem in distribution, because the sequence numbering cannot be
> resolved locally. I have therefore adopted a design in which every object
> version is conceptually distinct, and the delta management is an artifact of
> the store implementation. The API clearly needs to provide enough hints that
> the server can do something vaguely intelligent about creating deltas, but
> must not introduce a serialization requirement. I can see no inherent reason
> why the "base" file of any given delta cannot be arbitrarily chosen. The
> most efficient delta is likely to come from a previous version of the same
> file, but there is no law of nature that says this will always be true.

The XDFS interface keeps a LOCAL sequence number, it does not force any sort of serialization. There are two ways you can name versions. The first is to use a message digest function, and this is automatically supported by XDFS. As XDFS reads the file handle supplied to xdfs_insert it computes the message digest function and when it is done, it indexes the version by its digest value (and it ignores duplicate insertions). Using this option XDFS provides a method to look up versions by their digest value. Also you can build your own indexing scheme if you like. Since XDFS returns an Inode to the caller of xdfs_insert, you can link that Inode into an index of your choice using whatever name you like. XDFS is integrated with a file system abstraction so that things like this work.

XDFS always chooses the most recent local version as a base for delta computation. RCS does something more complicated, but my experimental results show that the use of ancestry worsens storage efficiency. This is a counter-intuitive result. It really simplifies the storage interface, not having to specify the ancestor version with every insertion (because what version do you pick after a merge, anyway?).

> So: is the revise() interface above a sufficient interface, or would
> integrating with your stuff require something more? If so, can you say what
> that might be?
>
>
> Hmm. I'm also making a naive assumption, which is that the rsync() algorithm
> does a pretty good job of minimizing traffic across a wire if we assume that
> the store's delta generator did a decent job. The exception to this is that
> I may not want to sync all of a repository, so at some baseline I need to
> get something that is a complete file from which all of my deltas can be
> resolved.

I don't understand this.

> I'm not currently worrying much about optimizing the wire protocol. The
> current protocol has:
>
> get(universal-name) => byte-buffer
>
> The server is free to ship the response as either a delta or a complete
> file. I will shortly be adding:
>
> get-expanded(universal-name) => byte-buffer
>
> This can be used by the client to obtain a baseline version (undelta'd) of a
> file. It is the responsibility of the store to ensure that it is always
> capable of generating a file in expanded form.
>
> Your design appears to want very much to support:
>
> get-delta(from-universal-name, to-universal-name)
>
> I certainly see why this is useful, but I don't plan to leverage this in the
> first version of the DCMS wire protocol. The algorithms for deciding how to
> leverage a get-delta() function are non-trivial, and here again I'm after
> getting something that works first. I am inclined to suspect that
> get-delta() is a second order optimization on the wire transfer.

I think the folks using CVSup would disagree. If delta compression is second-order, what's first-order?

In XDFS you use Inodes to as handles to the objects in storage. You can open an Inode and read it, that gives you the expanded object. You can also extract a delta by supplying two Inodes. So this is basically what you said except that naming is divorced from the interface.

-josh