html2txt in E: Sanity Check
Mark S. Miller
markm@caplet.com
Thu, 15 Apr 1999 18:03:47 -0700
The following code won't work for you until I release 0.8.2 (soon, soon),
but I think its meaning should be clear. Since many of you know html (both
what's standard and what's practiced) vastly better than I do, I was
wondering if y'all could give me a sanity check on this. It does seem to work
on the files I've tried it on, but I was by no means exhaustive. Btw, how
would you write it in Perl or Python, and how does this compare?
I hope the only thing non obvious in the following code is the use of
quasi-literal patterns and expressions. All text between quasi-quotes (`)
is fed to a quasi-parser, except embedded $<identified>, ${<expression>),
@<identifier>, and @{<pattern>). Instead, the embedded expression following
the $ is evaluated and substituted in, and the parts extracted from the
specimen at positions corresponding to @s are matched against those embedded
pattern. When this pattern is simply a name, this defines that name to hold
that part of the specimen.
The quasi-parser is designated by the optional name to the left of the first
quasi-quote. If left out, it defaults to "simple", the simplest quasi
parser, whose string substitution and matching rules are as literal as
possible within the above framework. The other quasi-parser used below is
"rx", a full Perl-5 regular expression engine from "http://www.oroinc.com",
and wrapped (by E code) to turn it into a proper quasi parser.
http://www.oroinc.com/developers/docs/OROMatcher/Syntax.html will tell you
everything I needed to know about Perl 5 regular expression syntax (and has
a link to the official explanation).
I'm doing this now, because I need to start treating my html pages as updoc
source, and regression-testing them. It's the only way I can keep the
documentation and the implementation in synch. And who knows, with enough
testing, I may even discover a new bug.
I'm also gearing up to write the chapter on Quasi-Literals, and thought
this might be a nice advanced example.
Cheers,
--MarkM
define parseInt(str) {
import:java.lang.Integer parseInt(str)
}
define entity2txt(entity) {
switch (entity) {
match =="lt" { "<" }
match =="gt" { ">" }
match =="quot" { "\"" }
match =="amp" { "&" }
# there is a unicode non-breaking space character,
# but we use ascii space instead. This means our
# transformation isn't reversible
match =="nbsp" { " " }
# does html accept other radii?
match rx`^#(@digits[0-9]+)$$` {
"" + parseInt(digits) asChar
}
match _ { `&$entity;` }
}
}
define html2txt(html) {
define result := ""
while (html != "") {
switch (html) {
match rx`(?ms)(@text.*?)(@thing(<.*?>|&.*?;))(@rest.*)` {
result += text
switch (thing) {
match `<@_>` {}
match `&@entity;` {
result += entity2txt(entity)
}
}
html := rest
}
match _ {
result += html
html := ""
}
}
}
result
}