Antlr and E (was: More Perl regex stuff in E)

Mark S. Miller markm@erights.org
Sat, 17 Apr 1999 15:15:12 -0700


At 08:43 AM 4/16/99 , mzukowski@bco.com wrote:
>Do you have a need for real language transformation tools?  I'm quite
>involved with the ANTLR community and have done some serious parsers and
>translators using that tool (Pick to VB, GCC to GCC).  It has a hook so that
>it can generate to different languages by adding your own code generator.
>Currently it supports Java and C++.  Since E is similar to Java it shouldn't
>be too hard to generate E directly.  If it generated E directly, that would
>imply that the generated code would be secure, yes?  
>
>What sort of need would the E community have for an LL lexer, parser, and
>tree parser generation tool?

Yes, yes, and yes.  As I mentioned, I browsed around starting at 
"http://www.antlr.org".  I'm impressed.  As you've probably gathered, though 
we provide a framework for pluggable parsers, we provide no help for 
creating such parsers.  You are correct that code using those parsers would 
be known to be safe only if written in E.  For the same reason, E's 
mobile code plans involve the automatic replication only of code written in 
E.  (These plans are not yet implemented.)  E's Kernel language -- the 
compilation target -- is very simple and stable.  The constructs listed in 
http://www.erights.org/elang/elangmanual.pdf have changed very little since 
this was written (caution: this hasn't been updated in a long time, so it 
is somewhat stale), and the core E libraries are settling rapidly.  Would 
you be interested in writing this backend, and packaging Antlr as a 
quasi-parser generation tool?



Another place we might be able to benefit from Antlr: replacing the current 
E lexer and/or parser.  We are currently using a lexer I wrote by hand in 
Java, and a parser generated from a yacc grammar by BYacc/Java 
"http://www.lincom-asg.com/~rjamison/byacc/".  I remain impressed with how 
readable a grammar is when expressed in Yacc's notation.  It's much 
better than other parser generators I've seen so far, but I don't get Antlr 
well enough yet to have an opinion.

In any case, our current lexer works rather well (no known bugs), but it's 
not the most maintainable piece of code in the world.  Our parser, on the 
other hand, has bigger problems.  

1) Although BYacc/Java generates a parser in Java, BYacc/Java itself is 
in C (it's a modified Berkeley yacc), making a pure-java-based E less self 
sufficient.  This isn't a big deal, but it would be nice to fix.

2) The lexer properly reports source position information to the parser.  
However, I couldn't figure out any way for the parser to gather this into 
the parse trees it generates without replicating code in each of the 
semantic actions.  Until this is fixed, we can't build a reasonable E 
debugger.  Does Antlr help here?

3) I like the fact that Antlr let's you declare semantic types per 
production.  BYacc/Java doesn't, so all my semantic values are declared 
"Object" and cast at runtime.

4) While the error reporting by our lexer is adequate, the error reporting 
from our parser sucks.  This is mostly my fault, as I never learned how to 
abuse yacc into reporting decent errors.  However, I'm intrigued by the Antlr 
approach: since you generate a stack-based recursive descent parser, you can 
throw, catch, and handle exceptions according to the call stack.  Your 
ability to abstract an exception is determined by the nesting structure of 
the grammar.  Cool!  How well does this work in practice?  Btw, we have no 
need for error recovery.  I just want to give the programmer a decent 
diagnostic for the first error, and stop.

5) Currently E runs by interpreting parse trees.  This is great when you're 
evolving a specification, but as mentioned above, the evolution of the 
Kernel E parseNodes and their meaning is essentially done, and the cost of 
interpretation is starting to hurt.  It's time to write a compiler.  I'm 
intrigued by Antlr's ability to write grammars for parsing tree structures 
rather than sequences.  Does this actually help one write code generators?  
Our two plausible compilation targets would be C and Java bytecodes.  Once 
#2 is fixed, Java source will not be a candidate, since Java has no #line 
directive.

6) A non-issue but amusing: Antlr's is the only license I've seen that's 
even less restrictive than the short Berkeley licence (which cover BYacc/Java).


Some concerns:


a) E's grammar is straight LALR(1).  Our grammar is accepted by yacc with no 
warnings, and with no ambiguity-resolving directives.  With one exception, 
there is no feedback from the parser to the lexer.  (If necessary, this 
exception would be easy to fix.)  Antlr is LL(k).  What's the relationship?  
If I end up maintaining my grammar in Antlr, is there an easy way to 
ensure I also remain LALR(1) so yacc hackers can build tools for E?  I'm not 
a parsing theory guy.

b) On current jvms, how much of a speed penalty do you pay for recursive 
descent vs yacc's bottom-up table driven approach?  My impression is that 
current JITs do well manipulating scalar data and arrays of same, and on 
loops and other within-method control flow, but they almost always generate 
the equivalent of a virtual function call for every Java call, even when 
they should know better.  Most processors can't prefetch across a virtual 
call.  My impression is also that lexing & parsing time often dominate 
compilation time.

c) How big is it?  I downloaded it and there seems to be a lot of stuff, but 
I'm not clear how much was necessary to generate a parser vs to run a parser 
vs how much was example.  Our current yacc-generated parser is 272KB in 
three class files.  (Ten times larger that all of OROMatcher, yow!)  Our 
lexer (ELexer.java) compiles to a 15KB class file.

d) As mention above, how readable would the grammar be?

e) If Antlr is LL(k), why'd y'all choose an acronym ending in LR?


Would you like to take a shot at writing the current E grammar in Antlr?  
It's "e.y" in the E source distribution.  However it turns out, I'm sure 
we'd all love to hear about it!


	Cheers,
	--MarkM