Compiling E: Phases of Transformations
Dean Tribble
tribble@netcom.com
Wed, 09 Aug 2000 10:25:44 -0700
Apparently this was not delivered to the list because it had HTML in it (I
suspect other postings have gone awry for the same reason).
----------------------------
At 02:22 PM 8/6/00 -0700, Mark S. Miller wrote:
>As shown by Scheme compilers and their descendents (notably Rabbit, Orbit,
>and the unreleased Joule compiler), an elegant compilation technique is to
BTW the Joule compiler was released years ago, just not
open-source. Additional references: the single best paper on the subject
was in 1990 or thereabouts by Will Clinger. It was concise, and had in a
much more obviously simple framework many fo the transformations that
looked hard in Rabbit, etc. Also, Appel's ML books apply the same
techniques and explain them well.
>At the end of this part of compilation, the original program is expressed in
>a very constrained subset of the language. This subset should be designed
Note that the constrained subset may have *additional* semantics over the
language (for example, in Joule, order of statements became significant
after some transformation stage).
>to make code generation from there a pleasant and effective task. Ideally,
>all interesting platform independent aspects of compilation can be embodied
>in the source-to-source transformation phases.
As should some aspects of the platform-dependent compilation (for example,
some of the named compilers go all the way to register assignment!).
>Each such further constrained subset can be given a name. The following are
>the language subsets and transformations I am currently thinking about for E.
>In further email, I'll then explain my rationale for these choices.
>
>(Note: When I say "parse tree" or "parse node" I mean "AST" or "AST node".
>Sorry, but it's what I'm used to.)
I would not assume that you will name all the stages. Once you have a
compiler of this architecture, it is extremely useful to occasionally
insert phases that do some really trivial thing (transform the code so that
the number of references to a variable are stored in a constant expression
(that will eventually be eliminated as dead code) right before the variable
definition. Which language would that be? Often, the phases are the same
sub language, you just know more.
There's also a fundamental decision to make: Should everything be
represented as syntactic expressions with the language (in which case you
end up with the above bizarre extra expressions that nonetheless would
compile to the same program? Or, should you use richer parse nodes that
can represent the meta-information (such as number of references)? In the
latter case, the parse nodes would "print" as a legal program, but any
meta-information would be lost (or in encoded in comments, or in explicit
meta-constructs such as type declarations).
>Language: User-E -- the language users write in
>
>Transformation: Local Expansion -- currently done in the parser
> If you think of "while" as a macro, think of this as
> macro-expansion.
>
>Language: Expanded-E -- uses very few parser node types
>
>Transformation: Name Capture -- make explicit all significance of
> user-written names, such as for upgrade support.
>
>Language: Renamable-E -- meaning would be preserved under renaming
>
>Transformation: Alpha Renaming -- non-outer-variables given unique generated
> names
>
>Language: Illuminated-E -- no shadowing
Something to consider here: only rename variables that would shadow other
variables. Then, *most* variables don't have gobbledygook on them. Only
shadowing variables need renaming. The downside would be that later
transformations could result in shadowing. Hmm. I guess the best thing to
do is rename with a special character such that it's easy to strip the
renaming (Joule uses another technique, but it is loosely equivalent to
renaming by appending to every variable name the ID (and a separator) of
the syntactic scope).
>Transformation: De-cycling & "declare"-expansion -- forward refs become
> Promises. (Dean plausibly suggests we leave this
> till last.)
>
>Language: Auditable-E -- portable interface to virtual machine
> What the virtual machine is asked to load.
>
>Transformation: Auditing -- Ask auditors to certify parse trees
>
>Language: Kernel-E -- portable interface from virtual machine
> The subset of Auditable-E that passes audit. Seen
> by debuggers and other meta-level tools.
>
>-------------
>The transformation from User-E to Kernel-E is standard, portable, and
>canonical. Modulo choice of generated names, two E implementations must
This seems like a bad idea (and is the real reason I wrote this
message). Any legal optimization (or indeed any optimization at all that
the programmer is willing to risk) should be allowed prior to this. The
compiler works on behalf of the programmer; why should you limit his
tools. Think of the compiler as a tool that the programmer uses to produce
kernel E.
>transform the same User-E program into the same Kernel-E program. This is
>the form of E normally seen by programs that manipulate programs.
>
>Below this point, the transformations are expected to be useful across
>compilation targets, but with possible differences for different targets.
>In any case, the following transformations are invisible to unprivileged
>normal users of E. The virtual machine must appear to be interpreting
>Kernel-E.
>-------------
>
>Transformation: Deslotifying -- de-reify variables-as-Slots. See
>
>http://www.erights.org/enative/varcases.html#deslotify
>
>Language: Memory-E -- variables are no more than memory locations
>
>Transformation: Simplification -- constant folding, dead code elimination, ...
> Possibly, some simplifications should be mixed into
> earlier phases as well.
>
>Language: Memory-E but not as stupid
Definitely many of these should happen early, and this is an area where you
*want* people to extend the compilation process!
>Transformation: Early Declaration -- variables declared at their
> allocation-contour. See
> http://www.erights.org/enative/varcases.html#scoping
>
>Language: E-gen -- the last stage normally represented in E parse
> trees.
Same as the above. I assumed that this happened long before "kernel" E.
BTW a comment on platform dependent optimizations: with syntactic patterns
representing register assignment, the transformed code can still be a
general, portable E program that just happens to run directly and best on a
particular architecture. Thus, a program requiring 8 temporaries on a 4
register machine might be rewritten to use 3 temporary variables (declared
in order of most frequent use) with the last temporary pointing at an array
of 5 cells. The references to the cells that were no longer temporaries
would be changed to array fetches. The programs would operate correctly on
any machine, but would directly represent the structure necessary for the 4
register machine (and be a poor layout for an 8 register machine). Appel's
book goes into transformations to represent register-saves calling
conventions, etc.