> ... I will say that you seem to be using a stricter
> definition of 'text' than me. When I say 'text' I mean simply 7-bit
ASCII,
> no control characters. In particular, it would be perfectly fine to
insist
> on Unix-style line endings...
This relates to somebody else's comments on possibly using MIME types, so let me expand on my view of this in case it's helpful down the road.
I see a CM system as having a number of layers. At the bottom, it manages a bunch of objects purely in terms of bits. It's job is to get a consistent set of bit-bundles to the next layer up. At this layer, I feel there should be no interpretation of content, and consequently no typing. This is why I want a (conceptually) binary-only repository. My one exception to the "no interpretation" is that the repository at this layer is free to use compressed representations as a matter of internal storage convenience. Such compression should not (in principle) be externally visible -- not least because making it visible screws up delta handling.
At the next conceptual layer up there is some input and output filtering that needs to happen. This handles things like newline canonicalization and variable expansion. To my mind, the execution of such filters is the job of the user agent. The repository server shouldn't have anything to do with it.
At the level of MIME types, we run into a couple of problems. First, there isn't a good way to reliably extract MIME types from content or fsName of an object. We can use heuristics, but ultimately this is the kind of thing that wants to be stored as a per-object attribute somewhere. I think the idea of being able to record MIME types and use them to drive things like visualizers is cool, but I think it's none of the repository's business. This attribution is purely a convention between the user agents. The repository's role is merely to store the attribute(s) along with the object.
After a while, I concluded that the "type" of an object at the repository level consists of the set of trigger behaviors that the object exhibits in response to configuration actions (e.g. checkout, commit). This is different from the type of the content, though the two will in practice be closely related. The previously mentioned filters are an example of such triggers.
But consider this (real) example: I have one source tree that contains an autorun.inf file for a CD. This file is (to all appearances) a text file, but it must be preserved in MS newline format, even when used on a UNIX system. This doesn't change it's MIME type at all.
So I basically think that the notion of "type" is quite complicated. One goal for DCMS is to have a platform-independent scripting language that lets triggers be written which behave the same on all platforms.
Sorry if this seems slightly disjointed -- my attention is divided at the moment.
> I'm glad to hear you agree with me that strong checksums are necessary.
I'd
> settle for CRC or adler32, but SHA sure would be nice.
I'm not using checksums. I'm using cryptographic hashes. The difference is significant. CRCs and adler32's provide a good means of integrity checking, but they collide too frequently to serve as *names*. In the DCMS repository, the "repository name" of an object *is* its SHA-1 hash. One effect of this is to greatly simplify transaction and replication handling.
> Incidentally, where can I find more information about your DCMS?
There is a mailing list: dcms-dev@eros-os.org. At the moment there is no design document, though I am in the process of drafting one. Version 0.1 is still in progress.