html2txt in E: Sanity Check

Mark S. Miller markm@caplet.com
Thu, 15 Apr 1999 18:03:47 -0700


The following code won't work for you until I release 0.8.2 (soon, soon), 
but I think its meaning should be clear.  Since many of you know html (both 
what's standard and what's practiced) vastly better than I do, I was 
wondering if y'all could give me a sanity check on this.  It does seem to work 
on the files I've tried it on, but I was by no means exhaustive.  Btw, how 
would you write it in Perl or Python, and how does this compare?

I hope the only thing non obvious in the following code is the use of 
quasi-literal patterns and expressions.  All text between quasi-quotes (`) 
is fed to a quasi-parser, except embedded $<identified>, ${<expression>), 
@<identifier>, and @{<pattern>).  Instead, the embedded expression following 
the $ is evaluated and substituted in, and the parts extracted from the 
specimen at positions corresponding to @s are matched against those embedded 
pattern. When this pattern is simply a name, this defines that name to hold 
that part of the specimen.

The quasi-parser is designated by the optional name to the left of the first 
quasi-quote. If left out, it defaults to "simple", the simplest quasi 
parser, whose string substitution and matching rules are as literal as 
possible within the above framework.  The other quasi-parser used below is 
"rx", a full Perl-5 regular expression engine from "http://www.oroinc.com", 
and wrapped (by E code) to turn it into a proper quasi parser. 
http://www.oroinc.com/developers/docs/OROMatcher/Syntax.html will tell you 
everything I needed to know about Perl 5 regular expression syntax (and has 
a link to the official explanation).

I'm doing this now, because I need to start treating my html pages as updoc 
source, and regression-testing them.  It's the only way I can keep the 
documentation and the implementation in synch.  And who knows, with enough 
testing, I may even discover a new bug.

I'm also gearing up to write the chapter on Quasi-Literals, and thought 
this might be a nice advanced example.

	Cheers,
	--MarkM



define parseInt(str) {
    import:java.lang.Integer parseInt(str)
}


define entity2txt(entity) {
    switch (entity) {
        match =="lt"    { "<" }
        match =="gt"    { ">" }
        match =="quot"  { "\"" }
        match =="amp"   { "&" }
        
        # there is a unicode non-breaking space character,
        # but we use ascii space instead.  This means our
        # transformation isn't reversible
        match =="nbsp"  { " " }
        
        # does html accept other radii?
        match rx`^#(@digits[0-9]+)$$` {
            "" + parseInt(digits) asChar
        }
        match _         { `&$entity;` }
    }
}

define html2txt(html) {
    define result := ""
    while (html != "") {
        switch (html) {
            match rx`(?ms)(@text.*?)(@thing(<.*?>|&.*?;))(@rest.*)` {
                result += text
                switch (thing) {
                    match `<@_>` {}
                    match `&@entity;` {
                        result += entity2txt(entity)
                    }
                }
                html := rest
            }
            match _ {
                result += html
                html := ""
            }
        }
    }
    result
}