Parse trees v. ASTs
by way of "Mark S. Miller" <markm@caplet.com>
mzukowski@bco.com
Wed, 08 Sep 1999 11:06:30 -0700
(Mark, please cross post to e-lang along with my previous message. Even
using e-lang@mail.eros-os.org as jshap suggested fails. I can't get to the
web site either, must be something with our DNS I suppose.)
A parse tree is simply a tree where each non-terminal is a node and only
terminals are leaves. The parse tree faithfully records which rules were
called in what order, each rule reference becoming a new child of the
enclosing rule's node. So if we have an extremely simple expression
grammar:
expr: addExpr;
addExpr: multExpr ( ('+'^ | '-'^) multExpr )* ;
multExpr: constant ( ('*'^ | '/'^) constant)*;
constant: Number | ID | LPAREN^ expr RPAREN!;
Don't worry about the '^' or '!' yet, they are tree building operators. In
LL grammars such as ANTLR generates, precedence is implied by recursion.
You can see that if the expression to parse is just a Number, then the
parser does a chain of calls expr->addExpr->multExpr->constant to parse it.
In fact the parse tree for a Number like 3 would be
expr->addExpr->multExpr->constant->Number(3), where Number is the type of
the terminal and 3 is it's value. So the first win for ASTs is to eliminate
the redundant parts of the parse tree.
Precedence can be implied by your tree structure. The '^' operator shown
above literally says "promote this node to be the current root, and make the
current root my child." Antlr trees are child-sibling trees, so if no '^'
is present, then the default tree-building action is to create a node as the
previous node's sibling. A rule that matches a bunch of stuff but never has
a '^' will simply produce a flat list of sibling nodes. The '!' says don't
attach it to the tree, just discard it. There is no need to put the RPAREN
into the tree because it is implied by the LPAREN which will be the root of
the subtree. Likewise there is no need to keep SEMICOLONs as found in C or
Java. You would build a subtree with a STATEMENT root and know that the
whole subtree makes up just one statement.
parsing a+b creates a tree like this:
#( '+' a b )
parsing a+b*c above gives us a tree like:
#( '+' a #('*' b c) )
a*b+c:
#( '+' #('*' a b) c )
a * ( b + c ):
#( '*' a #( LPAREN #('+' b c) ) )
You should be able to follow the above rules by hand to see why the trees
are built that way. For more info check out www.antlr.org.
Now consider a C "for" loop, without tree building operators,
forstmt: "for"
LPAREN ( e1:expr )? SEMI ( e2:expr )? SEMI ( e3:expr )?
RPAREN
s:statement
;
Each of the expressions for loop initialization, loop test, loop increment
can be empty. To make a compact and regular representation of a for loop,
we can make it's tree structure be (in tree grammar notation):
forstmt: #("for" expr expr expr statement) ;
Now if we have empty expressions we have to put in a node that represents an
empty expression. The alternative is to keep the SEMIs in the tree and
check for expressions in between, but then you are doing as much
decision-making as the parser did (to see if indeed the optional expressions
are there). At the expense of introducing a dummy node that represents an
emtpy expression we can allow the tree parser to not have any decision
making to do when tree-parsing the forstmt. This speeds things up and
really simplifies things if you are doing tree-rewriting such as in a
multi-pass analysis and compilation program.
So my point is that ASTs are mainly a way to simplify a parse-tree's
structure. Tree parsers then give you quite a bit of flexibility. Consider
one of my favorite tricks:
program: (statement)* ;
statement: #("print" topexpr)
| #("while" topexpr (statments)*)
| stmtexpr
;
stmtexpr: topexpr ;
topexpr: expr ;
expr: plusexpr | multexpr | constant;
plusexpr:
#("+" expr expr)
| #("-" expr expr)
;
multexpr:
| #("*" expr expr)
| #("/" expr expr)
;
OK, now if you want to analyze every plusexpr you can override that rule and
do your analysis and transformation. If you want to analyze every expression
subtree, you can
do it in expr. If you want to analyze only complete expressions, analyze
topexpr. If you want to analyze only expressions which are standalone
statements, override
stmtexpr. In all cases "1+2" will be plusexpr and an expr, the tree being
#("plus" expr expr). Only in some cases will it be a topexpr or a stmtexpr,
depending on the
context in which it is found in the tree.
I used this in my AREV to VB translator because "IF A" in AREV had to be
translated to "IF Exists(A)" in VB where Exists was a function we wrote to
handle the case where A is uninitialized. In VB "IF A" would die if A is
not initialized, but it was OK for AREV. But of course I didn't want to put
the Exists() call around every ID in an expression, only if the entire
expression was just an ID.
For some other analysis I needed to know things like whether a "return" was
a nested in an "IF" or whether it was at the top level of code so I could
deduce that code following it was unreachable except by GOTO.
When manipulating ASTs it really is handy to build rules from an existing
grammar. My modus operandi is to build a vanilla tree parser that does
nothing, and then use grammar inheritence to override whatever methods I
need to for that particular pass. One pass might be expression
manipulation, one might translate case statements to if statements (AREV
case stmts are really if-else chains), one might name mangle, one analyzes
for dead code. My AREV translator was 8 passes.
What was really nice was that I could represent both the AREV grammar and VB
grammar with one tree-grammar by introducing new VB specific node types and
VB tree grammar rules. Hence the new VB rules were orthogonal (no
ambiguities) to the AREV rules because of the node types, but they could
still use AREV node types for things in common like IDs. By combining the
tree grammars this way I could translate in stages but still have the
intermediate representations be tree parsable. This is something you could
not do with parse trees or text representations because you need to know
which parts are in which language...
Monty