Parse trees v. ASTs
Chip Morningstar
chip@communities.com
Wed, 8 Sep 1999 12:21:06 -0700 (PDT)
I'm a little confused about this. Maybe there's some context here that
I'm missing since I'm interested in hearing about how ANTLR is
distinguished from, say, YACC, as a potential parse tool for E. If
that's not what we're talking about, please forgive much of the
following.
mzukowski@bco.com sez:
>
>A parse tree is simply a tree where each non-terminal is a node and only
>terminals are leaves. The parse tree faithfully records which rules were
>called in what order, each rule reference becoming a new child of the
>enclosing rule's node. So if we have an extremely simple expression
>grammar:
>
>expr: addExpr;
>addExpr: multExpr ( ('+'^ | '-'^) multExpr )* ;
>multExpr: constant ( ('*'^ | '/'^) constant)*;
>constant: Number | ID | LPAREN^ expr RPAREN!;
>
>Don't worry about the '^' or '!' yet, they are tree building operators. In
>LL grammars such as ANTLR generates, precedence is implied by recursion.
Isn't that how everybody does it? Although there *are* explicit
operator-precedence features in YACC (I don't know about other LALR(1)
parser generators as I haven't used them), these are really just a
syntactic sugar (in any case I never use them because I always get
confused by them. Defining the precedence hierarchy recursively is the
only way I can keep my head straight.)
>You can see that if the expression to parse is just a Number, then the
>parser does a chain of calls expr->addExpr->multExpr->constant to parse it.
>In fact the parse tree for a Number like 3 would be
>expr->addExpr->multExpr->constant->Number(3), where Number is the type of
>the terminal and 3 is it's value. So the first win for ASTs is to eliminate
>the redundant parts of the parse tree.
What redudancy is eliminated? As far as I can see that's the parse
tree anybody would generate.
>Precedence can be implied by your tree structure. The '^' operator shown
>above literally says "promote this node to be the current root, and make the
>current root my child." Antlr trees are child-sibling trees, so if no '^'
>is present, then the default tree-building action is to create a node as the
>previous node's sibling. A rule that matches a bunch of stuff but never has
>a '^' will simply produce a flat list of sibling nodes. The '!' says don't
>attach it to the tree, just discard it.
>There is no need to put the RPAREN
>into the tree because it is implied by the LPAREN which will be the root of
>the subtree. Likewise there is no need to keep SEMICOLONs as found in C or
>Java.
Once again, doesn't everybody do that? Generally speaking, nobody
would keep the parens and semicolons (actually, sometimes you do when
you need to keep enough information to reconstruct the source text
verbatim, including whitespace and comments, which I've done on
occasion, but if that's what you were doing you'd have to do the same
tricks here).
>a * ( b + c ):
>
>#( '*' a #( LPAREN #('+' b c) ) )
Why keep the LPAREN, for that matter? The hierarchy keeps track of the
precedence anyway.
>Now consider a C "for" loop, without tree building operators,
>
>forstmt: "for"
> LPAREN ( e1:expr )? SEMI ( e2:expr )? SEMI ( e3:expr )?
>RPAREN
> s:statement
> ;
>
>Each of the expressions for loop initialization, loop test, loop increment
>can be empty. To make a compact and regular representation of a for loop,
>we can make it's tree structure be (in tree grammar notation):
>
>forstmt: #("for" expr expr expr statement) ;
>
>Now if we have empty expressions we have to put in a node that represents an
>empty expression. The alternative is to keep the SEMIs in the tree and
>check for expressions in between, but then you are doing as much
>decision-making as the parser did (to see if indeed the optional expressions
>are there). At the expense of introducing a dummy node that represents an
>emtpy expression we can allow the tree parser to not have any decision
>making to do when tree-parsing the forstmt. This speeds things up and
>really simplifies things if you are doing tree-rewriting such as in a
>multi-pass analysis and compilation program.
Again, am I missing something? Isn't this how everybody does it?
What's different about the ANTLR approach?
>So my point is that ASTs are mainly a way to simplify a parse-tree's
>structure.
Where's the simplification? Or have we actually been using ASTs all along
and just calling them "parse-trees" out of sloppiness?
Chip