[e-lang] XML library: requirements gathering
Kevin Reid
kpreid at mac.com
Sun Jan 17 12:44:30 PST 2010
On Jan 17, 2010, at 14:03, Thomas Leonard wrote:
> 2010/1/17 Kevin Reid <kpreid at mac.com>:
>> I'm currently implementing it by wrapping DOM trees with an immutable
>> interface, as this seems like both the simplest path and one which
>> minimizes the amount of (currently slow) E code executed in the high-
>> repeat-count paths.
>
> Makes sense. We'd probably want methods to get the underlying Document
> (or a copy, more likely) so we can pass it to existing Java code and
> wrap the results again.
Perhaps this should be part of a generic input-output scheme, like
javax.xml.transform uses. Or not.
> Sounds good. How do namespaces combine? Sometimes you have a lot of
> child elements using the same namespace and it's handy if it ends up
> as a single namespace declaration on the root element. Are you
> planning to preserve prefixes in any way (or just auto-number them as
> xmlns:n0, xmlns:n1, etc)? In the past I've used a scheme where we keep
> prefix mappings as a hint when parsing the XML and use them if
> possible when serialising, but combine multiple prefixes into one
> where possible or create new prefixes where there are conflicts so
> that we end up with all the mappings defined on the root element.
In the long run, this should be an option; currently, it will be
"whatever Java does" if Java does in fact ensure namespace
consistency. In general, I will preserve prefixes, since XML Infoset
says they are significant. Note that e.g. XSLT documents make
references to prefixes inside of attribute values (XPath expressions)
which makes renaming hairy.
>> * It is possible to construct a customized XML quasiparser with a
>> given set of namespace declarations.
>>
>> * There are means to traverse and pattern-match XML trees.
>> Currently I
>> have two plans in mind for this:
>> 1. XPath expressions can be used as subscripts. Example:
>> ? xml`<a>xyz</a><b>foo</b><a>bar</a>`[xpath`a/text()`]
>> # value: [xml`xyz`, xml`bar`]
>> (This is a working-right-now example.)
>>
>> 2. Pattern matching, as offered currently by term-trees:
>> def xml`<input type="text" name="@name" value="@value">` := elem
>
> Both this and XPath would be very useful. Would this work?
>
> def xml`<@elementName @attrs*>@content</@elementName>` := ...
>
> When matching on XML I mostly want to ignore unmatched attributes, but
> I guess adding @_* would be OK, to stay consistent with the rest of E.
This general sort of syntax is the sort of thing I want to support.
You'd have to write </> or </...>, not repeat @elementName; E doesn't
support that style of pattern matching.
The tricky part is supporting the quasi-holes while still using a
standard XML parser (for ease of implementation, efficiency, and well-
tested correctness). This means I substitute into each hole in the
template a unique value which I can find in the result tree afterward.
However, these values need to be valid XML syntax in the relevant
context.
The contexts I intend to support are:
* General XML content: <foo>@bar</foo>
* Element name: <@bar/>
* Attributes: <foo @bar>
* Attribute values: <foo baz="@bar">
* Processing instruction conten: <?@foo bla bla @bar?>
The basic strategy I plan to use for the values is to generate a
pseudorandom string, check that it isn't in the input, then substitute
it in and look for it in the output. The tricky part is constructing a
string which is valid in the context. I don't think it's possible to
generate from purely local context, because there is no string which
is well-formed both as an attribute and an attribute value.
Therefore, the hole support will need a custom lexer, which I will
write in Java for practical efficiency. ...On the other hand, maybe I
should just take an existing XML parser and extend it to support holes?
For now, XPath should be sufficient to extract data values from XML.
--
Kevin Reid <http://switchb.org/kpreid/>
More information about the e-lang
mailing list