[e-lang] The Power of Irrelevance (was: Method vs. matcher semantics)

Mark S. Miller e-lang@mail.eros-os.org
Sun, 01 Feb 2004 03:00:14 -0800


I'd like to thank Dave Liebs for prodding me into writing this.


At 11:08 PM 1/31/2004  Saturday, Dean Tribble wrote:
>>Kevin wrote:
>>>>Also, shouldn't a result guard [on match] be allowed?
>>
>>At 03:09 PM 1/22/2004  Thursday, Dean Tribble wrote:
>>>I think so.
>>
>>If this worked syntactically, it would be fine. It would not increase the ...
>
>I presume that requiring parens on the argument to 'match' would introduce 
>other problems (in addition to actually violating backwards compatibility).

Yes. The general style followed by most of E, after a clause-introducing 
keyword, is that parens surround expressions, not patterns:

    escape pattern {
    } catch pattern {
    while (expr) {
    if (expr) {

The other major style is function / method declaration, in which we have an 
identifier followed by a list of patterns within parens:

    to ident(pattern,...) ... {
    def ident(pattern,...) ... {
    when (expr,...) -> ident(pattern) ... {

Our glaring inconsistency, we don't require parens around the collection 
expression in a for-loop. We accept:

    for pattern => pattern in expr {
and
    for pattern in expr {

whereas, if we were consistent, we'd require

    for pattern => pattern in (expr) {
and
    for pattern in (expr) {

What's cool about this convention is that it enables a human reader to 
perform scope analysis on the pre-expansion text by following a few simple 
rules. Why is this important?


    The Power of Irrelevance:
    On the design of notation to support the review process


One the lessons driven home by our Xanadu experience: Any data structure 
used to answer queries from a huge amount of data must first of all provide 
the property of quick reject: it must quickly disqualify most of the data it 
holds from being relevant to the query, so that it can afford to do more 
subtle calculations on what remains. A good example might be rectangular 
bounding boxes around complex shapes in a graphics app, so that we can 
quickly reject all shapes that don't almost intersect a ray. We can then do 
a more subtle calculation on the remaining candidates, to see if they do 
intersect.

In thinking about lessons from the security review you and David put us 
through ;), I came to realize that much of its success was due to E's 
support for a similar quick reject property. Given a question and a large 
body of largely unfamiliar code, the first job of a programming language 
notation is to help human readers (both with and without IDE help) do a 
quick reject: to disqualify most of the program as being irrelevant to the 
current question without needing to look at most of it.

Scope analysis is the reader's main tool for quickly determining, when 
looking at a program fragment, which things cannot influence what other 
things. Scope analysis gives a first conservative bound on possible 
influence analysis.

This explanation helps explains what's wrong with Smalltalk-72, all macro 
systems I know about[*], dynamic scoping, ambient authority, Pascal's "var" 
and C++'s references. When looking at a call site (or something that locally 
looks like it is a call site), can the reader tell, without knowledge of the 
thing apparently being called, what things of the caller are being made 
accessible to the callee. All the above language elements disrupt that 
analysis. For example, in E or C-without-macros, in the call

    foo(x)

without knowing foo, we still know that the callee is being given access to 
x's value, but isn't given the ability to assign to the x variable. If the E 
or C-without-macro programmer wishes to give foo the ability to change x's 
value, they need to write

    foo(&x)

which makes this extra access visible to readers that don't know foo. Pascal 
and C++ don't have this property. Unless you know how foo declares the 
corresponding parameter, you can't tell what the caller has given the callee 
the ability to do. These tend towards the failure mode of which Smalltalk-72 
was the extreme: you can't understand anything about the program until you 
understand everything.

In Smalltalk-76, Smalltalk-80, Squeak, ML, Kernel-E, or 
Scheme-without-macros, the program is written in a simple language with very 
few scoping rules, where those rules reliably allow the reader, from a 
program fragment, to use scope analysis to disqualify much of the program 
from relevance to a question, without looking up the definition of things 
mentioned at call sites. This property is essential for secure code, but is 
also important in general.

Clearly, all macro systems prior to Scheme's invention of the notion of 
hygienic macros (who invented this notion?) fatally disrupt this ability. 
Scope analysis was only done after macro expansion, so the reader would have 
to simulate the expansion in their head before that could use scope 
analysis. This requires understanding which calls are actually macro calls, 
and it requires looking up the definition of all those macros and 
understanding them.

What about hygienic macros themselves? This has pointed the way towards 
sense, but doesn't go far enough to achieve the above goal. The reader of a 
fragment of a Scheme program that might be using hygienic macros must still 
do everything listed above before scope analysis allows them to disqualify 
any of the program from relevance to their question.

In the absence of macro expansion, we can think of the normal analysis of 
code as proceeding by the following steps:

1. Reading sequences of bits to recognize characters. (With UTF-8, this step 
    isn't as trivial as it once was.)
2. Lexing: Reading sequences of characters to recognize tokens.
3. Parsing: Reading sequences of tokens to recognize BNF productions, 
    producing ASTs.
4. Scope Analysis: Walking ASTs to match use and defining occurrences of 
    identifiers, hopefully by simple lexical scoping rules.

I believe the above sequence works so well because it has some kind of 
psychological match to what kinds of recognition tasks humans perform well 
when looking at program text. I won't try to defend this claim, but I do 
assume it. From my own introspection it seems right. Yes, I know 
introspection is a terrible guide, but it's the best I've got.

In any case, we categorize macro systems according to where they intervene 
in the above sequence:

1.5. Pre-ANSI-C operated on character sequences prior to lexing.
2.5. ANSI-C operates on the token stream between lexing and parsing.
3.5. Non-hygienic Lisp macro systems operated on the S-Expression (Lisp's 
       AST) between parsing and scope analysis.
4. Scheme's hygienic macros operates after parsing, and interleaved with 
    scope analysis.

I believe most would agree that each step along this sequence has led to an 
increase in quality, from a software engineering standpoint. I claim that 
the reason is that each step has enabled a reader to understand more about a 
program fragment prior to simulating the expansion of the macros. The 
E-to-Kernel-E expansion comes close to providing the following property. 
Where E fails to provide the following property, that should be considered a 
bug to be fixed. Should E ever allow user-defined macros, these must 
preserve the following property as well. The productions reserved for macros 
in the E grammar are indeed able to produce a macro system with this property.

4.5. Expansion happens after scope analysis. Scope analysis therefore 
happens on the pre-expansion text, without reference to the definition of 
the macros being invoked. (Of course, it need not be implemented this way, 
as long as the effect is equivalent.)

I claim this would be better by direct extrapolation of the above argument.

However, we also wish the results of expansion to be a program in the 
language, so that we may use the same notation to explain the results of 
expansion. This leads to the following rule:

Imagine that we printed the expanded program (in our case, the resulting 
Kernel-E program) as the source text of a program in the source language (in 
our case, an E program). If a scope analysis of the expanded program would 
yield different bindings than the pre-expansion scope analysis, statically 
reject the program.

Here's an example where the current E implementation fails to obey this 
rule, should be considered buggy, and should be fixed:

    ? def i := 3
    # value: 3

    ? for i in (0..i) { println(i) }
    0
    1
    2
    3
    ?

Surprised? You should be. By the normal E rules, which the E reader can rely 
on almost all the time, an identifier comes into scope starting at its 
defining occurrence and proceeding left-to-right until the end of its 
containing scope box, except as shadowed by definitions of the same 
identifier in contained scope boxes. A scope box starts at the keyword 
introducing a clause (here, "for"), and lasts until the corresponding close 
curly. http://www.erights.org/elang/blocks/ifExpr.html

By this rule, the reader would be justified in expecting the "i" in "(0..i)" 
to refer to the "i" in "for i in". Because of the expansion of the for-loop, 
and because the current E implementation's scope analysis, like pre-hygienic 
Lisps, operates only after expansion, and because the for-loop expansion 
reverses the order in which these two parts occur, as well as placing the 
iteration variable "i" in a nested scope box, the "i" in "(0..i)" refers to 
the "def i", whereas the "i" in "println(i)" refers to the iteration 
variable "i" which shadows the "i" in "def i".

What E should do, ie, what we are here specifying a non-buggy E will do, is 
to statically reject the above program. This may possibly give the program's 
author an unpleasant static surprise. But the inability to author this kind 
of obfuscated program will prevent nasty surprises to the readers, which is 
vastly more important. Even for programs I author, I read them much more 
than I author them. Legal programs must fulfill the expectations both of 
readers who are thinking about the pre-expanded text, and of readers who are 
simulating the expansion in their head.

What does this have to do with parens around patterns? At the head of a 
scope box, ie, the part between the keyword introducing the clause and the 
open curly, when we see an identifier, in order to do scope analysis by the 
above rule, we need to be able to tell whether it's a use occurrence or a 
defining occurrence. In order to determine this, you need to know if it's in 
an expression, a pattern, or something else. Given that we can ask the 
reader to memorize a small fixed set of "something else" cases (eg, 
identifier(patterns,...) after a "to"), the only remaining rule is: If it's 
surrounded with parens, it's an expression. If not, it's a pattern.

Should we ever allow user-defined macros, they would be constrained to play 
by this rule, enabling pre-expansion scope analysis to proceed using the 
left-to-right rule. As with the for-loop, rather than constrain macros only 
to produce expansions that preserve the scope analysis, we can instead 
reject just those programs that cause an expansion that would disrupt the 
pre-expansion scope analysis.

Other known occurrences of this bug in the current E implementation:

* Dean has objected that the current expansions of && and || can disrupt a 
pre-expansion scope analysis. I agree. The current expansions of these must 
be considered buggy on these grounds. Fortunately, an alternate expansion 
can fix this in all cases, never requiring a surprising static reject of a 
program.

* In the expansion of, for example, the quasi-pattern `-$i-@i-@j-$j-`, the 
relative order among the embedded dollar-hole expressions is preserved, and 
the relative order among the @-hole patterns is preserved, but all the 
extracted expressions occur in the expansion to the left of all the 
extracted patterns. As one would expect from the pre-expansion text, $i 
refers to the i already in scope, not that defined by @i. But $j also refers 
to the j already in scope, not the j defined by @j. This violates the 
left-to-right rule. E must instead statically reject the program, rather 
than allow a reader to be confused about the meaning of $j.


[*] JSE, The Java Syntactic Extender, is a possible exception. It is clearly 
a fellow traveller, but I don't yet understand it well enough to know it is 
has the property we advocate here.

----------------------------------------
Text by me above is hereby placed in the public domain

        Cheers,
        --MarkM