Antlr and E (was: More Perl regex stuff in E)
Mark S. Miller
markm@erights.org
Sat, 17 Apr 1999 15:15:12 -0700
At 08:43 AM 4/16/99 , mzukowski@bco.com wrote:
>Do you have a need for real language transformation tools? I'm quite
>involved with the ANTLR community and have done some serious parsers and
>translators using that tool (Pick to VB, GCC to GCC). It has a hook so that
>it can generate to different languages by adding your own code generator.
>Currently it supports Java and C++. Since E is similar to Java it shouldn't
>be too hard to generate E directly. If it generated E directly, that would
>imply that the generated code would be secure, yes?
>
>What sort of need would the E community have for an LL lexer, parser, and
>tree parser generation tool?
Yes, yes, and yes. As I mentioned, I browsed around starting at
"http://www.antlr.org". I'm impressed. As you've probably gathered, though
we provide a framework for pluggable parsers, we provide no help for
creating such parsers. You are correct that code using those parsers would
be known to be safe only if written in E. For the same reason, E's
mobile code plans involve the automatic replication only of code written in
E. (These plans are not yet implemented.) E's Kernel language -- the
compilation target -- is very simple and stable. The constructs listed in
http://www.erights.org/elang/elangmanual.pdf have changed very little since
this was written (caution: this hasn't been updated in a long time, so it
is somewhat stale), and the core E libraries are settling rapidly. Would
you be interested in writing this backend, and packaging Antlr as a
quasi-parser generation tool?
Another place we might be able to benefit from Antlr: replacing the current
E lexer and/or parser. We are currently using a lexer I wrote by hand in
Java, and a parser generated from a yacc grammar by BYacc/Java
"http://www.lincom-asg.com/~rjamison/byacc/". I remain impressed with how
readable a grammar is when expressed in Yacc's notation. It's much
better than other parser generators I've seen so far, but I don't get Antlr
well enough yet to have an opinion.
In any case, our current lexer works rather well (no known bugs), but it's
not the most maintainable piece of code in the world. Our parser, on the
other hand, has bigger problems.
1) Although BYacc/Java generates a parser in Java, BYacc/Java itself is
in C (it's a modified Berkeley yacc), making a pure-java-based E less self
sufficient. This isn't a big deal, but it would be nice to fix.
2) The lexer properly reports source position information to the parser.
However, I couldn't figure out any way for the parser to gather this into
the parse trees it generates without replicating code in each of the
semantic actions. Until this is fixed, we can't build a reasonable E
debugger. Does Antlr help here?
3) I like the fact that Antlr let's you declare semantic types per
production. BYacc/Java doesn't, so all my semantic values are declared
"Object" and cast at runtime.
4) While the error reporting by our lexer is adequate, the error reporting
from our parser sucks. This is mostly my fault, as I never learned how to
abuse yacc into reporting decent errors. However, I'm intrigued by the Antlr
approach: since you generate a stack-based recursive descent parser, you can
throw, catch, and handle exceptions according to the call stack. Your
ability to abstract an exception is determined by the nesting structure of
the grammar. Cool! How well does this work in practice? Btw, we have no
need for error recovery. I just want to give the programmer a decent
diagnostic for the first error, and stop.
5) Currently E runs by interpreting parse trees. This is great when you're
evolving a specification, but as mentioned above, the evolution of the
Kernel E parseNodes and their meaning is essentially done, and the cost of
interpretation is starting to hurt. It's time to write a compiler. I'm
intrigued by Antlr's ability to write grammars for parsing tree structures
rather than sequences. Does this actually help one write code generators?
Our two plausible compilation targets would be C and Java bytecodes. Once
#2 is fixed, Java source will not be a candidate, since Java has no #line
directive.
6) A non-issue but amusing: Antlr's is the only license I've seen that's
even less restrictive than the short Berkeley licence (which cover BYacc/Java).
Some concerns:
a) E's grammar is straight LALR(1). Our grammar is accepted by yacc with no
warnings, and with no ambiguity-resolving directives. With one exception,
there is no feedback from the parser to the lexer. (If necessary, this
exception would be easy to fix.) Antlr is LL(k). What's the relationship?
If I end up maintaining my grammar in Antlr, is there an easy way to
ensure I also remain LALR(1) so yacc hackers can build tools for E? I'm not
a parsing theory guy.
b) On current jvms, how much of a speed penalty do you pay for recursive
descent vs yacc's bottom-up table driven approach? My impression is that
current JITs do well manipulating scalar data and arrays of same, and on
loops and other within-method control flow, but they almost always generate
the equivalent of a virtual function call for every Java call, even when
they should know better. Most processors can't prefetch across a virtual
call. My impression is also that lexing & parsing time often dominate
compilation time.
c) How big is it? I downloaded it and there seems to be a lot of stuff, but
I'm not clear how much was necessary to generate a parser vs to run a parser
vs how much was example. Our current yacc-generated parser is 272KB in
three class files. (Ten times larger that all of OROMatcher, yow!) Our
lexer (ELexer.java) compiles to a 15KB class file.
d) As mention above, how readable would the grammar be?
e) If Antlr is LL(k), why'd y'all choose an acronym ending in LR?
Would you like to take a shot at writing the current E grammar in Antlr?
It's "e.y" in the E source distribution. However it turns out, I'm sure
we'd all love to hear about it!
Cheers,
--MarkM