[e-lang] XML library: requirements gathering

Kevin Reid kpreid at mac.com
Sun Jan 17 12:44:30 PST 2010


On Jan 17, 2010, at 14:03, Thomas Leonard wrote:
> 2010/1/17 Kevin Reid <kpreid at mac.com>:
>> I'm currently implementing it by wrapping DOM trees with an immutable
>> interface, as this seems like both the simplest path and one which
>> minimizes the amount of (currently slow) E code executed in the high-
>> repeat-count paths.
>
> Makes sense. We'd probably want methods to get the underlying Document
> (or a copy, more likely) so we can pass it to existing Java code and
> wrap the results again.

Perhaps this should be part of a generic input-output scheme, like  
javax.xml.transform uses. Or not.

> Sounds good. How do namespaces combine? Sometimes you have a lot of
> child elements using the same namespace and it's handy if it ends up
> as a single namespace declaration on the root element. Are you
> planning to preserve prefixes in any way (or just auto-number them as
> xmlns:n0, xmlns:n1, etc)? In the past I've used a scheme where we keep
> prefix mappings as a hint when parsing the XML and use them if
> possible when serialising, but combine multiple prefixes into one
> where possible or create new prefixes where there are conflicts so
> that we end up with all the mappings defined on the root element.

In the long run, this should be an option; currently, it will be  
"whatever Java does" if Java does in fact ensure namespace  
consistency. In general, I will preserve prefixes, since XML Infoset  
says they are significant. Note that e.g. XSLT documents make  
references to prefixes inside of attribute values (XPath expressions)  
which makes renaming hairy.

>> * It is possible to construct a customized XML quasiparser with a
>> given set of namespace declarations.
>>
>> * There are means to traverse and pattern-match XML trees.  
>> Currently I
>> have two plans in mind for this:
>>   1. XPath expressions can be used as subscripts. Example:
>>     ? xml`<a>xyz</a><b>foo</b><a>bar</a>`[xpath`a/text()`]
>>     # value: [xml`xyz`, xml`bar`]
>>   (This is a working-right-now example.)
>>
>>   2. Pattern matching, as offered currently by term-trees:
>>     def xml`<input type="text" name="@name" value="@value">` := elem
>
> Both this and XPath would be very useful. Would this work?
>
> def xml`<@elementName @attrs*>@content</@elementName>` := ...
>
> When matching on XML I mostly want to ignore unmatched attributes, but
> I guess adding @_* would be OK, to stay consistent with the rest of E.

This general sort of syntax is the sort of thing I want to support.  
You'd have to write </> or </...>, not repeat @elementName; E doesn't  
support that style of pattern matching.

The tricky part is supporting the quasi-holes while still using a  
standard XML parser (for ease of implementation, efficiency, and well- 
tested correctness). This means I substitute into each hole in the  
template a unique value which I can find in the result tree afterward.  
However, these values need to be valid XML syntax in the relevant  
context.

The contexts I intend to support are:
   * General XML content: <foo>@bar</foo>
   * Element name: <@bar/>
   * Attributes: <foo @bar>
   * Attribute values: <foo baz="@bar">
   * Processing instruction conten: <?@foo bla bla @bar?>

The basic strategy I plan to use for the values is to generate a  
pseudorandom string, check that it isn't in the input, then substitute  
it in and look for it in the output. The tricky part is constructing a  
string which is valid in the context. I don't think it's possible to  
generate from purely local context, because there is no string which  
is well-formed both as an attribute and an attribute value.

Therefore, the hole support will need a custom lexer, which I will  
write in Java for practical efficiency. ...On the other hand, maybe I  
should just take an existing XML parser and extend it to support holes?

For now, XPath should be sufficient to extract data values from XML.

-- 
Kevin Reid                                  <http://switchb.org/kpreid/>



More information about the e-lang mailing list