Antlr and E (was: More Perl regex stuff in E)
mzukowski@bco.com
mzukowski@bco.com
Mon, 19 Apr 1999 09:26:28 -0700
-----Original Message-----
>From: Mark S. Miller [SMTP:markm@erights.org]
>Sent: Saturday, April 17, 1999 3:15 PM
>To: mzukowski@bco.com
>Cc: markm@caplet.com; e-lang@eros.cis.upenn.edu
>Subject: Antlr and E (was: More Perl regex stuff in E)
>
>At 08:43 AM 4/16/99 , mzukowski@bco.com wrote:
>>Do you have a need for real language transformation tools? I'm quite
>>involved with the ANTLR community and have done some serious parsers and
>>translators using that tool (Pick to VB, GCC to GCC). It has a hook so
that
>>it can generate to different languages by adding your own code generator.
>>Currently it supports Java and C++. Since E is similar to Java it
shouldn't
>>be too hard to generate E directly. If it generated E directly, that
would
>>imply that the generated code would be secure, yes?
>>
>>What sort of need would the E community have for an LL lexer, parser, and
>>tree parser generation tool?
>
>Yes, yes, and yes. As I mentioned, I browsed around starting at
>"http://www.antlr.org". I'm impressed. As you've probably gathered,
though
>we provide a framework for pluggable parsers, we provide no help for
>creating such parsers. You are correct that code using those parsers would
>be known to be safe only if written in E. For the same reason, E's
>mobile code plans involve the automatic replication only of code written in
>E. (These plans are not yet implemented.) E's Kernel language -- the
>compilation target -- is very simple and stable. The constructs listed in
>http://www.erights.org/elang/elangmanual.pdf have changed very little since
>this was written (caution: this hasn't been updated in a long time, so it
>is somewhat stale), and the core E libraries are settling rapidly. Would
>you be interested in writing this backend, and packaging Antlr as a
>quasi-parser generation tool?
>
Yes I'm interested. I don't have a whole lot of free time though.
>
>
>Another place we might be able to benefit from Antlr: replacing the current
>E lexer and/or parser. We are currently using a lexer I wrote by hand in
>Java, and a parser generated from a yacc grammar by BYacc/Java
>"http://www.lincom-asg.com/~rjamison/byacc/". I remain impressed with how
>readable a grammar is when expressed in Yacc's notation. It's much
>better than other parser generators I've seen so far, but I don't get Antlr
>well enough yet to have an opinion.
I find ANTLR much more readable because there is a one to one relationship
between rules and methods. The LL, top-down nature of it makes it work the
way you would expect--it was designed to generate the kind of code you would
have written by hand.
>
>In any case, our current lexer works rather well (no known bugs), but it's
>not the most maintainable piece of code in the world. Our parser, on the
>other hand, has bigger problems.
>
>1) Although BYacc/Java generates a parser in Java, BYacc/Java itself is
>in C (it's a modified Berkeley yacc), making a pure-java-based E less self
>sufficient. This isn't a big deal, but it would be nice to fix.
>
>2) The lexer properly reports source position information to the parser.
>However, I couldn't figure out any way for the parser to gather this into
>the parse trees it generates without replicating code in each of the
>semantic actions. Until this is fixed, we can't build a reasonable E
>debugger. Does Antlr help here?
Yes, I've done this before. Tree nodes are created from Token objects and
can replicate any information they need from the Token.
>3) I like the fact that Antlr let's you declare semantic types per
>production. BYacc/Java doesn't, so all my semantic values are declared
>"Object" and cast at runtime.
Not sure if you noticed that ANTLR lets you specify arguments to rules as
well.
>4) While the error reporting by our lexer is adequate, the error reporting
>from our parser sucks. This is mostly my fault, as I never learned how to
>abuse yacc into reporting decent errors. However, I'm intrigued by the
Antlr
>approach: since you generate a stack-based recursive descent parser, you
can
>throw, catch, and handle exceptions according to the call stack. Your
>ability to abstract an exception is determined by the nesting structure of
>the grammar. Cool! How well does this work in practice? Btw, we have no
>need for error recovery. I just want to give the programmer a decent
>diagnostic for the first error, and stop.
It works pretty well and can be fine tuned. The default message is just
stating something like what the parser was expecting and what rule it was
in. This can be fine-tuned at the rule level to say more meaningful things
like a declaration was expected here. The nice thing about error recovery
is that you can give lots of errors and then stop.
>5) Currently E runs by interpreting parse trees. This is great when you're
>evolving a specification, but as mentioned above, the evolution of the
>Kernel E parseNodes and their meaning is essentially done, and the cost of
>interpretation is starting to hurt. It's time to write a compiler. I'm
>intrigued by Antlr's ability to write grammars for parsing tree structures
>rather than sequences. Does this actually help one write code generators?
>Our two plausible compilation targets would be C and Java bytecodes. Once
>#2 is fixed, Java source will not be a candidate, since Java has no #line
>directive.
It certainly does help one write code generators. I've done two source to
source translation projects and you would be amazed at how difficult it is
to do right to output human-readable source. Antlr gives you the
flexibility needed for that though.
The big win is the analysis and transformation you can do with the tree
parsers. It's so clean once you're manipulating trees.
>6) A non-issue but amusing: Antlr's is the only license I've seen that's
>even less restrictive than the short Berkeley licence (which cover
BYacc/Java).
>
>
>Some concerns:
>
>
>a) E's grammar is straight LALR(1). Our grammar is accepted by yacc with
no
>warnings, and with no ambiguity-resolving directives. With one exception,
>there is no feedback from the parser to the lexer. (If necessary, this
>exception would be easy to fix.) Antlr is LL(k). What's the relationship?
>If I end up maintaining my grammar in Antlr, is there an easy way to
>ensure I also remain LALR(1) so yacc hackers can build tools for E? I'm
not
>a parsing theory guy.
You won't be able to automate a transformation of a grammar to/from LL/LALR.
But if you really don't have any ambiguities then it won't be difficult.
Not many people really like writing parsers--they usually write a parser to
do something. Having the parser available will let the hackers build tools.
If you want to maintain a LALR parser, go ahead. I don't expect you will be
changing your language specification to suit the parsing tool!
Parser to lexer feedback is a no-no. That will have to be fixed.
>b) On current jvms, how much of a speed penalty do you pay for recursive
>descent vs yacc's bottom-up table driven approach? My impression is that
>current JITs do well manipulating scalar data and arrays of same, and on
>loops and other within-method control flow, but they almost always generate
>the equivalent of a virtual function call for every Java call, even when
>they should know better. Most processors can't prefetch across a virtual
>call. My impression is also that lexing & parsing time often dominate
>compilation time.
I haven't done any tests, so I can't really give any meaningful answers to
these questions.
>c) How big is it? I downloaded it and there seems to be a lot of stuff,
but
>I'm not clear how much was necessary to generate a parser vs to run a
parser
>vs how much was example. Our current yacc-generated parser is 272KB in
>three class files. (Ten times larger that all of OROMatcher, yow!) Our
>lexer (ELexer.java) compiles to a 15KB class file.
Um, my ANSI C parser is 160K java source 65K class file, lexer is 70K
source, 32K class file.
>d) As mention above, how readable would the grammar be?
>
Extremely readable. ANTLR's specification is really an extended EBNF.
>e) If Antlr is LL(k), why'd y'all choose an acronym ending in LR?
>
ANoTher Language Recognizer. Pun in the same vein as yak, bison...
>Would you like to take a shot at writing the current E grammar in Antlr?
>It's "e.y" in the E source distribution. However it turns out, I'm sure
>we'd all love to hear about it!
>
Yes, this would be an easier first project than writing a parser code
generator, and probably more valuable in the short term. My time is
limitied but it shouldn't take long once I get started. Feel free to
motivate me by letting me know your schedule ;-)
Monty