Compiling E: Phases of Transformations

Dean Tribble tribble@netcom.com
Wed, 09 Aug 2000 10:25:44 -0700


Apparently this was not delivered to the list because it had HTML in it (I 
suspect other postings have gone awry for the same reason).
----------------------------

At 02:22 PM 8/6/00 -0700, Mark S. Miller wrote:
>As shown by Scheme compilers and their descendents (notably Rabbit, Orbit,
>and the unreleased Joule compiler), an elegant compilation technique is to

BTW the Joule compiler was released years ago, just not 
open-source.  Additional references: the single best paper on the subject 
was in 1990 or thereabouts by Will Clinger.  It was concise, and had in a 
much more obviously simple framework many fo the transformations that 
looked hard in Rabbit, etc.  Also, Appel's ML books apply the same 
techniques and explain them well.

>At the end of this part of compilation, the original program is expressed in
>a very constrained subset of the language.  This subset should be designed

Note that the constrained subset may have *additional* semantics over the 
language (for example, in Joule, order of statements became significant 
after some transformation stage).

>to make code generation from there a pleasant and effective task.  Ideally,
>all interesting platform independent aspects of compilation can be embodied
>in the source-to-source transformation phases.

As should some aspects of the platform-dependent compilation (for example, 
some of the named compilers go all the way to register assignment!).

>Each such further constrained subset can be given a name.  The following are
>the language subsets and transformations I am currently thinking about for E.
>In further email, I'll then explain my rationale for these choices.
>
>(Note: When I say "parse tree" or "parse node" I mean "AST" or "AST node".
>Sorry, but it's what I'm used to.)

I would not assume that you will name all the stages.  Once you have a 
compiler of this architecture, it is extremely useful to occasionally 
insert phases that do some really trivial thing (transform the code so that 
the number of references to a variable are stored in a constant expression 
(that will eventually be eliminated as dead code) right before the variable 
definition.  Which language would that be?  Often, the phases are the same 
sub language, you just know more.

There's also a fundamental decision to make:  Should everything be 
represented as syntactic expressions with the language (in which case you 
end up with the above bizarre extra expressions that nonetheless would 
compile to the same program?  Or, should you use richer parse nodes that 
can represent the meta-information (such as number of references)?  In the 
latter case, the parse nodes would "print" as a legal program, but any 
meta-information would be lost (or in encoded in comments, or in explicit 
meta-constructs such as type declarations).

>Language:        User-E -- the language users write in
>
>Transformation: Local Expansion -- currently done in the parser
>                         If you think of "while" as a macro, think of this as
>                         macro-expansion.
>
>Language:        Expanded-E -- uses very few parser node types
>
>Transformation: Name Capture -- make explicit all significance of
>                          user-written names, such as for upgrade support.
>
>Language:        Renamable-E -- meaning would be preserved under renaming
>
>Transformation: Alpha Renaming -- non-outer-variables given unique generated
>                         names
>
>Language:        Illuminated-E -- no shadowing

Something to consider here: only rename variables that would shadow other 
variables.  Then, *most* variables don't have gobbledygook on them.  Only 
shadowing variables need renaming.  The downside would be that later 
transformations could result in shadowing.  Hmm.  I guess the best thing to 
do is rename with a special character such that it's easy to strip the 
renaming (Joule uses another technique, but it is loosely equivalent to 
renaming by appending to every variable name the ID (and a separator) of 
the syntactic scope).

>Transformation: De-cycling & "declare"-expansion -- forward refs become
>                         Promises.  (Dean plausibly suggests we leave this
>                         till last.)
>
>Language:        Auditable-E -- portable interface to virtual machine
>                         What the virtual machine is asked to load.
>
>Transformation: Auditing -- Ask auditors to certify parse trees
>
>Language:        Kernel-E -- portable interface from virtual machine
>                         The subset of Auditable-E that passes audit.  Seen
>                         by debuggers and other meta-level tools.
>
>-------------
>The transformation from User-E to Kernel-E is standard, portable, and
>canonical.  Modulo choice of generated names, two E implementations must

This seems like a bad idea (and is the real reason I wrote this 
message).  Any legal optimization (or indeed any optimization at all that 
the programmer is willing to risk) should be allowed prior to this.  The 
compiler works on behalf of the programmer; why should you limit his 
tools.  Think of the compiler as a tool that the programmer uses to produce 
kernel E.

>transform the same User-E program into the same Kernel-E program.  This is
>the form of E normally seen by programs that manipulate programs.
>
>Below this point, the transformations are expected to be useful across
>compilation targets, but with possible differences for different targets.
>In any case, the following transformations are invisible to unprivileged
>normal users of E.  The virtual machine must appear to be interpreting
>Kernel-E.
>-------------
>
>Transformation: Deslotifying -- de-reify variables-as-Slots.  See
> 
>http://www.erights.org/enative/varcases.html#deslotify
>
>Language:        Memory-E -- variables are no more than memory locations
>
>Transformation: Simplification -- constant folding, dead code elimination, ...
>                          Possibly, some simplifications should be mixed into
>                          earlier phases as well.
>
>Language:         Memory-E but not as stupid

Definitely many of these should happen early, and this is an area where you 
*want* people to extend the compilation process!


>Transformation:  Early Declaration -- variables declared at their
>                          allocation-contour.  See
>                          http://www.erights.org/enative/varcases.html#scoping
>
>Language:         E-gen -- the last stage normally represented in E parse
>                          trees.

Same as the above.  I assumed that this happened long before "kernel" E.

BTW a comment on platform dependent optimizations:  with syntactic patterns 
representing register assignment, the transformed code can still be a 
general, portable E program that just happens to run directly and best on a 
particular architecture.  Thus, a program requiring 8 temporaries on a 4 
register machine might be rewritten to use 3 temporary variables (declared 
in order of most frequent use) with the last temporary pointing at an array 
of 5 cells.  The references to the cells that were no longer temporaries 
would be changed to array fetches.  The programs would operate correctly on 
any machine, but would directly represent the structure necessary for the 4 
register machine (and be a poor layout for an 8 register machine).  Appel's 
book goes into transformations to represent register-saves calling 
conventions, etc.