compiling E: Phases of Transformations
Mark S. Miller
markm@caplet.com
Wed, 09 Aug 2000 11:53:36 -0700
[I think perhaps this message was dropped as well. --MarkM]
At 08:58 AM 8/8/00 , Monty Zukowski wrote:
>>I don't have anything like textual file inclusion, so I don't think I have
>>*this kind* of nesting.
>
>What about when you call antlr for quasi-parsers, or what about being able to
>debug each stage of your transformations? You know you will want to look at
>the
>output of a particular pass to debug it, and you will want a way to identify
>the
>line in that particular pass based on the kernel-E.
>
>It could be external to the source, but needs to be thought about.
Well, these certainly needs to be thought through. You raise two very
different cases. Each deserves its own reply.
> What about when you call antlr for quasi-parsers
I assume your question is independent of the particular quasi-parser used,
but is instead about E's embedded quasi-parser syntax. For example, what
source positions should we assign to the nodes resulting from the expansion of:
e`($a + $b) * $c`
as they appear in somewhere in some source file? And how can we bring about
our answer? Remember that the E compiler must not make a special case for
(or have any special knowledge of) the "e" quasi-parser as opposed to any
other quasi-parser. We should start with the token stream the E Lexer
outputs for the above expression. I place between double-quotes below the
source text corresponding to a token, rather than some value of the token
that the lexer may calculate. Assume that the above source text starts at
position 100 in a source file named "foo.e". "..!" is E's operator for a
closed-open ascending interval.
"e" -- an Identifier @ foo.e 100..!101
"`(" -- a QuasiOpen @ foo.e 101..!103
"$" -- a '$' @ foo.e 103..!104
"a" -- an Identifier @ foo.e 104..!105
" + " -- QuasiOpen @ foo.e 105..!108
"$" -- a '$' @ foo.e 108..!109
"b" -- an Identifier @ foo.e 109..!110
") * " -- a QuasiOpen @ foo.e 110..!114
"$" -- a '$' @ foo.e 114..!115
"c" -- an Identifier @ foo.e 115..!116
"`" -- a QuasiClose @ foo.e 116..!117
Right now, the parser, when fed this token stream in expression context,
expands it to the equivalent of the following E code:
e__quasiParser valueMaker("(${0} + ${1}) * ${2}") substitute([a, b, c])
Unfortunately this expansion does not give the quasi-parser being called any
way to know about the source positions the pieces of this string originally
came from. The only way for it to know is for it to be told. Even to solve
this problem, we must now make a choice between three means of telling a
parser about original source positions:
1) by #line-like directives placed in the sources themselves,
2) parser-tree annotations, or,
3) as Monty suggests in different email, a text-position-to-text-position
mapping.
We can immediately rule out my favorite, #2, on the grounds that the E
parser doesn't even know what language the quasi-parser will parse, so it is
completely unable to feed it a parse tree rather than text. Much of the
modularity of the quasi-parser approach to syntax embedding is that
knowledge of a quasi-syntax need be shared only between the E programmer
writing the quasi-expression and the quasi-parser she is invoking. The only
constraint placed on these quasi-syntaxes is the special treatment of
${<integer>}
@{<integer>}
And that it accept
@@ as meaning what @ would have meant, even before a {
$$ as meaning what $ would have meant, even before a {
By "would have meant", often a quasi-syntax corresponds to some non-quasi
syntax, as in our proposed quasi-parser for XML
http://www.erights.org/elang/grammar/quasi-xml.html . "$${0}" in the input
to the XML quasi-parser should mean the literal XML string "${0}" rather
than being interpreted specially. In support of this special treatment, the
E expansion of a quasi-literal expression does not translate "$$a" or
"$${a}", but rather passes them on to the quasi-parser unmolested.
I went through the above machinations in order to argue against option #1 --
a #line-like directive. Were we to define such a directive to be recognized
by all quasi-parsers, we could get a similar kind of interference with all
the pre-existing grammars that a quasi-grammar might derive from. Instead,
let's explore the text-position-mapping idea. The original quasi-expression
might now expand to:
e__quasiParser valueMaker("(${0} + ${1}) * ${2}",
"foo.e",
[0 => 102, # "("
5 => 105, # " + "
12 => 110 # ") * "
],
) substitute([a, b, c])
This suggests that compilers, in general, should have the option of
providing them with such a mapping. To use another of our examples, if the
Java compiler had this option, although Java itself continued not to have a
#line directive, then ANTLR could generate both Java source code & such a
mapping for the Java compiler to then chew on. (Btw, if ANTLR were
augmented to generate such a mapping, it would be able to, instead,
post-process the *.class files produced by Java. When E compiles to the
Java platform, conceivably it should do likewise, rather than generating JVM
bytecodes.)
Within a compiler implementation, however, it may still be best to represent
source position information, and pass it from phase to phase, by attaching
it to parse-nodes. But it would be interesting to see an alternate concrete
proposal.
And just what is the plural of "syntax" anyway?
Cheers,
--MarkM