Quasi-Literal Expressions and Patterns in E
Mark S. Miller
markm@caplet.com
Sat, 17 Apr 1999 05:32:29 -0700
At 08:43 AM 4/16/99 , mzukowski@bco.com wrote:
>I'm not familiar enough with E or perl's regex syntax ...
I browsed around starting from www.antlr.org, and I see that you're a
language type, so here's some background. I'll then answer your message in
a later email. And thanks for stimulating me to write this -- it's now
the first draft of the upcoming chapter on Quasi-Literals.
The E language *per se* has no regular expression syntax. It has a framework
for plugging in parsers. A parser that plugs into this framework is called
a quasi-parser, and an expression or pattern making use of a quasi-parser is
called, respectively, a quasi-literal expression or a quasi-literal pattern.
(Or just quasi-expression and quasi-pattern.) This makes it easy to embed
code written in specialized sublanguages within an E program.
Why Quasi-Literals?
Consider the ancient C printf and scanf statements:
printf("hello %s world\n", worldState);
as contrasted with the modern
out.print("hello " + worldState + " world\n");
They both say approximately the same thing, but the notation of the first is
optimized to make visible the nature of the resulting value: It'll be like
the format string but with the %-holes filled in. The second form is instead
optimized to make visible the nature of the computation that will produce
this value. Unfortunately for my explanation, it doesn't seem to make much
difference on this case.
Though less familiar, the point is clearer with Lisp's quasi-quote, since it
expands to a normal expression:
`(hello ,worldState world)
which means "A list whose first element is the literal symbol 'hello, whose
second element is the value of the variable worldState, and whose third
element is the literal symbol 'world." Like "%", "," marks a hole to be
filled in at runtime. This expression macro-expands to
(cons 'hello (cons worldState '(world)))
which describes the process for making such a list. In the first form, the
nature of the list being produced is clear but the means of its production
-- two cons operations -- is obscure. In the second form the reverse is
true.
I believe this observation about the respective strengths and weaknesses of
the two kinds of notations goes back to Alonzo Church, the inventor of
Lambda. For most purposes, more literal forms of expression should be
preferred when possible, for the same reasons that user interfaces should
work by "direct manipulation" where possible. Much of the popularity of
modern "scripting" languages -- Perl, Python, Tcl, etc -- is due to their
support for quasi-literal expression of text by "interpolation"
($-substitution) and for quasi-literal text pattern matching by regular
expressions.
As the Lisp language shows, there's no reason a language can't enable
literals (or quasi-literals) to express data types other than scalars and
strings, but few languages do. Lisp provides for the [quasi-]literal
expression of lists. But what about arbitrary user-defined objects? By
providing a pluggable quasi-parser framework, each quasi-parser designer can
define their own syntax for [quasi-]literally expressing whatever data
types they choose. In closing below, we propose a radical future/past
extension of this direction.
Quasi-Literal Expressions
A quasi-literal expression consists of an optional quasi-parser name
(defaulting to "simple" if left out), a quasi-quote (also known as
backquote, "`"), a string in the grammar accepted by that quasi-parser, and
a closing quasi-quote. This string may also have several embedded
"$" <identifier> or "$" "{" <E Expression> "}" indicating values to be
substituted in. The quasi-parser sees the string without the embedded
expression source. For example:
foo`some text${bar() + 3}more text$baz`
expands to
quasiParsers["foo"] valueMaker("some text${0}more text${1}") \
substitute([bar() + 3, baz])
The quasi-parser named "foo" is looked up, and is asked to created a
ValueMaker from the transformed string. The transformation extracts each
expression and leaves behind a "$" "{" <digit>+ "}", ie, a "substitution
hole". Since this applies to all quasi-expressions, all quasi-grammars must
recognize these substitution holes. The resulting ValueMaker is then
invoked with a list of the values of the extracted expressions, which the
ValueMaker should "plug into" the corresponding substitution holes,
in order to make whatever kind of value the quasi-parser took to be
described by the quasi-string. We currently have two quasi-parsers that
support quasi-expressions: "simple" and "e".
The "simple" quasi-parser is indeed the simplest useful example consistent
with these rules. Except for the embedded holes, it takes the input text to
just describe itself. Substitution consists of stringification of the value,
and string substitution. It corresponds to the "interpolation" familiar
from at least Perl, Tcl, and various shells:
? `abc${2+3}def`
# value: abc5def
The expression expands to
quasiParsers["simple"] valueMaker("abc${0}def") substitute([2 + 3])
The "e" quasi-parser is simply E's own parser used as a quasi-parser. It
accepts E source-code as its quasi-grammar, and takes it to describe ASTs
(which we call parse trees) of the post-expansion Kernel E language:
? def sum := e`x + y`
# value: e`x add(y)`
? def prod := e`a * $sum * b`
# value: e`a multiply(x add(y)) multiply(b)`
In E, "+" and "*" are just syntactic sugar for the "add" and "multiply"
messages. (Kernel E parse trees print as e`...` quasi-expressions as
well.) Since the substitution is structural, not textual, the fact that "*"
binds tighter in E than "+" does not cause confusion. "deriv.e" in the
distribution is a symbolic differentiation package written in E using the e
quasi-parser so that algebraic expressions can be represented as trees, but
expressed in familiar infix-operator notation.
Notice that each time a quasi-expression is evaluated, the valueMaker()
argument is the same, though the substitute() argument is usually different.
Modulo some security issues beyond the scope of this email, quasi-parsers
should therefore cache ValueMakers by transformed string. It can then spend
a lot of work "compiling" a ValueMaker so that substitute() can go faster.
Quasi-Literal Patterns
A quasi-literal pattern (a quasi-pattern) is also written as an optional
quasi-parser name, a quasi-quote, a quasi-string with embedded stuff, and a
closing quasi quote. This time the embedded stuff can consist of several
"$"<identifier>
| "$" "{" <E expression> "}"
| "@"<identifier>
| "@" "{" <E pattern> "}"
You can think of E's grammar as having two parsing contexts: expression and
pattern. A quasi-literal is parsed and transformed as a quasi-expression
or a quasi-pattern based on its parsing context. The quasi-pattern
transformation is more complex:
? "abcde" =~ `a@{x}d@y`
# value: true
? x
# value: bc
? y
# value: e
As in Perl, "=~" is the pattern match operator. It takes an expression on
the left and a pattern on the right. That first line expands to:
("abcde" =~ _4q ? (quasiParsers["simple"] matchMaker("a@{0}d@{1}") \
matchBind([], _4q) =~ [x, y]))
"_4q" is an E-expansion-generated temporary variable name. The pattern
constructs needed to understand the above pattern are:
<identifier> Always matches, and binds identifier
to the specimen. This is the only form
of variable definition in E.
<pattern> "?" <expression> "?" is pronounced "such that". First
matches <pattern> to the specimen, if
it succeeds, evaluate <expression> in
the resulting scope. If it's value is
true, then the such-that pattern succeeds.
[<pattern>,...] Matches a list of the same length by
matching each element against the
corresponding pattern.
So, reading from the inside out, the quasi-parser named "simple" is asked to
produce a MatchMaker from the string "a@{0}d@{1}", which is presumed to be
in the quasi-grammar understood by this quasi-parser. (Once again, it pays
for the quasi-parser to cache a compiled MatchMaker indexed by this string.)
The MatchMaker is then asked to "matchBind(args, specimen)" where the args
are extracted from embedded $-expressions, as with the earlier
substitute(args). 'args' is here an empty list since there were no
embedded $-expressions, and the specimen is here "abcde" by the logic of "?".
If the matchBind() fails, it returns null, which will not match any list
pattern, even the empty list pattern, resulting in the overall match
failing. If the matchBind() succeeds, it returns a list with elements
extracted from the specimen by @-holes, and put into the list at positions
corresponding to the @-numbers. The MatchMaker resulting from the "simple"
quasi-parser just does the simplest useful anti-greedy (ascetic?) string
match and extract without backtrack consistent with this framework. Except
for holes, all characters just represent themselves and must match exactly.
As with quasi-expressions, none of this is specific to text. A MatchMaker
can accept whatever kind of object it wishes as specimen and extract
whatever it wishes as the bindings. As you might guess, the e quasi-parser
can also be used to match parse trees and extract subtrees. A portion of
deriv.e:
#
# What's the derivative of expr wrt var? Ie, what's "d(expr)/d(var)"?
#
define deriv(expr, var) {
switch (expr) {
match e`$var ** @exp` ? (isConst(exp)) {
e`$exp * $var ** ($exp -1)`
}
match e`@a + @b` {
define da := deriv(a, var)
define db := deriv(b, var)
e`$da + $db`
The last 3 lines could also have been written
e`${deriv(a, var)} + $(deriv(b, var)}`
depending on your taste. These two examples show the similarity between the
match-bind-substitute styles of programming as practiced by "scripting"
languages on text, and that practiced by Prolog on trees of symbols.
Regular Expressions as Quasi-Patterns
The third quasi-parser currently available in E is named "rx", standing for
"Regular eXression". "rx" takes a full Perl-5 regular expression engine
obtained from http://www.oroinc.com (now http://www.savarese.org ) and wraps
it to turn it into a quasi-parser. (This quasi-parser is written in E and
is found in the E distribution as "PerlMatchMakerMaker.emaker".) As
expected, most of the work is in the treatment of holes. The equivalent of
the above string match written using "rx" rather than "simple" would be:
? "abcde" =~ rx`a(@{x}.*?)d(@y.*)`
The expansion is the same of course, except that the "rx" quasi parser is
looked up instead of "simple", and the transformed string fed to
matchMaker() is
"a(@{0}.*?)d(@{1}.*)"
Parentheses play a role in Perl regular expressions related to E's @-holes:
they extract the part of the specimen matched against the sub-pattern they
contain. To generate a proper Perl regular expression, the "rx" wrapper
removes the @-holes leaving
"a(.*?)d(.*)"
but verifies that each @-hole was at the beginning of such a parenthesized
group, and remembers which group it is. The (cached) MatchMaker consists of
a compiled Perl Pattern object plus the correspondence between
paren-group-number and @-number. On a successful match it returns the
bindings in a list ordered according to the original @-numbers.
Unfortunately, to do this, "rx" must correctly distinguish between an "("
that starts a Perl-saved-group and one that doesn't. We currently impose
some restrictions (reject some valid Perl patterns) to make this recognition
easier, but we still don't always get this right. Fortunately, the known
bugs here haven't yet bitten me on any real examples, but this needs to be
fixed. Ideally, the OroMatcher code could be enhanced so "rx" could ask
Pattern where it thinks the open parens of save-groups are. I'll be asking
oroinc/savarese.org if they'd be interested it adding this query.
Long Term Future Directions: Hypercard
Our long term future takes us back to the conversation with Scott Kim
http://www.scottkim.com/ that pointed us in this direction long ago.
Hypercard had just come out and was obviously powerful in a way most of
had not seen before. Scott had a unique perspective on this power:
Most programming languages have as literal data types things like numbers,
characters, and strings. Lisp, in addition, has lists, but all these only
express literals using text strings. Why? Because the program itself is a
text string. A weird pun is going on: when you put the insertion point of
your text editor between the quotes of a literal string you're effectively
no longer in a program editor, but in a nested text editor. The text you
are editing is -- almost -- simply the text that that literal evaluates to.
Whereas normally programming is indirect, here we're operating by direct
manipulation -- creating the actual output directly.
Hypercard is normally conceived of as primarily a visual application thing
with an embedded silly programming language (Hypertalk). Think instead of a
whole Hypercard stack as a mostly literal program. In addition to
numbers and strings, the literals you can directly edit include bitmaps,
forms, buttons, and menus. You can literally assemble most things, but
where you need dynamic behavior, there's a hole in the literal filled in by
a Hypertalk script. The program editor for this script is experienced as
being nested inside the direct manipulation user interface editor.
It's one of those 15 minute conversations that change your life.
Though there have been many many visual interface builders since Hypercard,
I'm not aware of any that have pursued this analogy between partially
literal user-interface construction and Lisp quasi-quote. How should E
eventually pursue this?
Well, we could get fancy and say that source code should stop being text,
and should instead be a graph of objects. Unfortunately, though this
approach has paid off in the past, it is becoming increasingly less possible,
as the world becomes increasingly more wedded to text. (Overall net
progress can obscure many cases of horrifying regress, and the computer
world is rife with them.) Fortunately, there's a low road:
The Macintosh Resource Fork and ResEdit was a crude but effective form for
direct manipulation editing of an extensible set of non-textual literal data
types. They fell short of what we seek in only two regards: 1) The literals
weren't "quasi", ie, they were fully literal, rather than having holes to be
filled in by more code. 2) They weren't meaningfully embedded within the
program text -- not visually, and not scoping-wise. As pure literals, the
lack of scope embedding doesn't matter. But once you fix #1, you must make
sure the embedded expressions are in their lexical context.
Elmer currently uses the Java Swing plain text editor, a sibling of the Java
Swing compound document editor (of which the wysiwyg html editor is a
subclass). Java also has a resource mechanism (which I'd guess was inspired
by the Macintosh's), and Java can serialize/unserialize object graphs into
such resources.
A future E compound document editor could generate and recognize, for
example, the expression
reseditor`<resource name> foo=${expr0} bar=${expr1}`
unserialize the resource with this name into (hopefully) a ValueMaker that's
also an editable Model, and embed at this place in the program editor a
nested editor for this type of Model. Of course, to preserve the
quasi-literal illusion, the editor for the Model should seem like the editor
for the data type the ValueMaker makes (in response to substitute()), but it
would in turn contain embedded program editors at the positions in the Model
corresponding to the names "foo" and "bar". Perhaps the literal plane
should be shown above the program plane using shadows. The embedded program
editors would be shown, in wells using reverse shadows, to be on the same
plane (and so the same scope) as the surrounding program. On "Save", the
editor would serialize the ValueMaker back into resource land.
Similarly, of course, for serialized MatchMakers.
Besides user interface construction, what would you use this for? How about
generating graphics? Right now, you can either write computer graphics code,
which is very indirect but very general; or use a drawing program, which
works great as long as what you're drawing fits its limitations. What if
you're not even trying to produce a single drawing, but rather to produce a
program that does data dependent drawing? Currently you'd be stuck with a
mostly indirect way of expressing yourself. It would seem the ability
to intimately mix elements expressed at different levels of literalness
could let us use the best of both on the same task.
Or (borrowing slightly from the Garden system) how about embedding a
graphical state diagram or state chart editor. The transition arcs could
have holes containing actions expressed in code that's in the scope of the
lexical context containing the embedded state diagram.
Acknowledgement
The quasi-literals of E owe a tremendous debt to Alonzo Church, Lisp
quasi-quote, Scott Kim, Dean Tribble, and Joule. Thanks!!
E is an Open Everything project, including Open Source. Feel free to forward.
Cheers,
--MarkM