[E-Lang] Kernel-E in Minimal-XML

Jonathan S. Shapiro shap@cs.jhu.edu
Fri, 06 Oct 2000 10:30:52 -0400


[DCMS folks: this discussion started on the E list, where they are
discussing XML serialization. Since it's relevant here, I'm sending it
here too.]

Okay. I give up. I have what I think are good reasons for wanting full
XML for E, I think that it's not noticeably harder than MinXML, and
since I already have to solve this problem for PCMS I'm glad to provide
the framework. The only catch is that somebody will need to do the Java
conversion.

PCMS currently has a generalized serialization framework that will do
*either* binary or XML serialization. Originally I wanted binary, and I
think that for production use binary is preferable in PCMS, but XML is
sure damned handy for debugging. The framework could be augmented to do
Sanskrit, Ebonics, Morse Code, or even French (shudder :-) by rewriting
a small number of central routines. The good news is that the individual
object serializers don't know what the output format is -- changing the
language can be done entirely by rewriting the primitive wrappers.

The serializer does not require and could not take advantage of a DTD.
It is in the nature of DTD's that they are application-specific, though
I suppose that one could build a meta-DTD. It follows from the absence
of a DTD that the serializer is NOT robuse to version changes. I need to
augment the current format to output a version number so that this can
be fail fast, and eventually I'll need to bite the bullet and deal with
the issue. In PCMS the "solution" is to assign a different type code to
each version of an object (they are, after all, different objects). This
will let you deserialize an old object graph successfully, after which
conversion is the application's problem. Note that this is essential in
PCMS, as the crypto-hash names imply that you cannot perform in-place
conversion on the content. In a CM system you shouldn't do so anyway.

Following describes the XML serializer:

At the most primitive level, the serializer knows how to output unsigned
long, C string (which is output with it's length) and raw byte-sequence.
All other serialization is built on top of these primitives. For Java
one would also want bignum and signed integer. For PCMS I am tempted to
make TrueName a distinguished type of string.

The "design pattern" of the output looks like

    <memberName ty="TYPE">VALUE</ memberName>

Examples:

    output field myLen, an unsigned long whose value is 5:

    <myLen ty="ul">5</myLen>

    output field myName, a string whose value is "sometext":

    <myName ty="str" len="8">
      <len ty="ul">8</len>
      <txt>sometext</txt>
    </myName>

    (I just noticed the inconsistency around the <txt> element)
    Note that '<', '>', and '&' are converted to/from
    &lt;, &gt; and &amp;
    Note that this format does NOT work for unicode strings
    because of embedded zeros. I will either convert those to
    &0; or I will use hexy format in the future and introduce a new
    string type

    output field someBits, a binary value with 10 bytes:

    <someBits ty="bin" fmt="hex">abcd0102034056...</someBits>

    (I need to augment that with a length field to avoid unnecessary
     memory allocations but the current parse is unambiguous. At
     present it is the client's responsibility to somehow "know" the
     expected length. This should perhaps have been done more like
     strings, but the goal here was to avoid unnecessary memory
     allocations. I need to set up Ping's roundup tool so I can track
     such things...)

Here is this week's serialization of a branch object:

<BRANCH ty="Branch">
  <obtype ty="ul">2</obtype>
  <serTrueName ty="str">
    <len ty="ul">45</len>
    <txt>seq0.79fa87d31768570c917866011d31574d1479b4dc</txt>
  </serTrueName>
  <name ty="str">
    <len ty="ul">14</len>
    <txt>Primary Branch</txt>
  </name>
  <nickname ty="str">
    <len ty="ul">4</len>
    <txt>main</txt>
  </nickname>
  <creator ty="str">
    <len ty="ul">10</len>
    <txt>Test Agent</txt>
  </creator>
  <description ty="str">
    <len ty="ul">100</len>
    <txt>This is the main branch for the project. It was automatically
created
when the project was created.
</txt>
  </description>
  <changes ty="StrVec">
    <obtype ty="ul">9</obtype>
    <len ty="ul">0</len>
  </changes>
</BRANCH>

note two things about this:

1. The StrVec at the bottom captures the normal serialization of an
object.
2. The all-capped BRANCH at the start indicates that the root object was
a BRANCH. For the XML-aware, think of this as the document root tag. It
is not widely appreciated that in an XML file one names both the DTD and
also the root element for this document. One can therefore have a single
DTD that describes all object types.

The current library simply ignores all attributes on input. I'm planning
to revise this before release 0.2.

I suspect that this is "good enough" for use by E if only somebody will
translate the code to Java. With introspection, it should be possible to
construct a generic object serializer, which would make the entire thing
MUCH easier.

The main "feature" of the current mechanism and interface is that it is
format neutral. The binary serializer also works. Also, this format can
serialize either to/from files or to/from strings.

The most significant limitation of the current mechanism is that it does
not attempt to remember object identities. The PCMS object formats prove
to be a pure hierarchy, so it wasn't necessary to do identity
reconstruction. Doing identity reconstruction can be handled relatively
easily if all serializations start with a single top-level object, of it
is assured that the object will always have been deserialized at the
time it's pointer is deserialized. The latter constraint is probably the
better one. If helpful, I will augment the PCMS framework to do this
(which will give me a generalized library -- cool!) for the benefit of
those in the conversion business.


shap