configuration file format Jonathan S. Shapiro (shap@eros-os.org)
Thu, 13 Apr 2000 03:17:44 -0400

For a variety of reasons, I am partial to a binary file format for the configuration file. The main one is that otherwise people will be tempted to edit the damn things, and much depends on the fact that the configuration file is left alone. A secondary one is that they generate the same SHA value on all platforms.

That said, it doesn't have to be a particularly "clever" binary format. Indeed, it *shouldn't* be clever. Based on the preceding note about attribute storage, what I have in mind for the configuration file is as follows:

'D' 'C' 'M' S' # identification
'C' 'F' 'G' '0' # header

    version-no        # file format version
    object-count      # number of entities
off_t atable_offset # start of atable

entity-0
entity-1
...
entity-n

atom-table

where each entity is encoded as:

    fs-name    # unicoded as len, nbytes
    true-name  # ascii-coded as len, nbytes
    n_attr     # number of attributes
    attr-name  # word
    attr-value # word
    attr-name  # word
    attr-value # word

...
attr-name # word
attr-value # word

All strings are zero-padded to a 4-byte boundary. Entries are written sorted according to fs-name. Attributes within a set are written sorted according to attr-name. Attribute-values that cannot be expressed in a single word are stored in the atom table (really a string table), which has no particular sort.

The atom table is a sequence of binary strings of the form (len, bits). Each is padded to a word boundary. The first entry in the atom table is always (0, 0), and is reserved for the "null atom". Atom references are stored in the binary format as an index into the atom table.

The sorting rules ensure that a binary delta performed on the file arrives at a decently small change set. If file names were instead stored in the atom table, then the removal of an entry would have the side effect of removing an entry in the atom table, which would in turn cause all subsequent "name" values to change, leading to bad deltas. Attributes whose value cannot be expressed in a word should be sufficiently rare that I'm not really worried about them.

My expectation is that this file is ordinarily stored in compressed form, and that the common attribute value sequences will therefore compress out provided that a canonical ordering for writing them is enforced.

shap