Quasi-Literal Expressions and Patterns in E

Mark S. Miller markm@caplet.com
Sat, 17 Apr 1999 05:32:29 -0700


At 08:43 AM 4/16/99 , mzukowski@bco.com wrote:
>I'm not familiar enough with E or perl's regex syntax ...

I browsed around starting from www.antlr.org, and I see that you're a 
language type, so here's some background.  I'll then answer your message in 
a later email.  And thanks for stimulating me to write this -- it's now 
the first draft of the upcoming chapter on Quasi-Literals.


The E language *per se* has no regular expression syntax.  It has a framework 
for plugging in parsers.  A parser that plugs into this framework is called 
a quasi-parser, and an expression or pattern making use of a quasi-parser is 
called, respectively, a quasi-literal expression or a quasi-literal pattern. 
(Or just quasi-expression and quasi-pattern.)  This makes it easy to embed 
code written in specialized sublanguages within an E program.


			Why Quasi-Literals?


Consider the ancient C printf and scanf statements:

	printf("hello %s world\n", worldState);

as contrasted with the modern

	out.print("hello " + worldState + " world\n");

They both say approximately the same thing, but the notation of the first is 
optimized to make visible the nature of the resulting value:  It'll be like 
the format string but with the %-holes filled in.  The second form is instead 
optimized to make visible the nature of the computation that will produce 
this value.  Unfortunately for my explanation, it doesn't seem to make much 
difference on this case.

Though less familiar, the point is clearer with Lisp's quasi-quote, since it 
expands to a normal expression:

	`(hello ,worldState world)

which means "A list whose first element is the literal symbol 'hello, whose 
second element is the value of the variable worldState, and whose third 
element is the literal symbol 'world."  Like "%", "," marks a hole to be 
filled in at runtime.  This expression macro-expands to

	(cons 'hello (cons worldState '(world)))

which describes the process for making such a list.  In the first form, the 
nature of the list being produced is clear but the means of its production 
-- two cons operations -- is obscure.  In the second form the reverse is 
true.  

I believe this observation about the respective strengths and weaknesses of 
the two kinds of notations goes back to Alonzo Church, the inventor of 
Lambda.  For most purposes, more literal forms of expression should be 
preferred when possible, for the same reasons that user interfaces should 
work by "direct manipulation" where possible.  Much of the popularity of 
modern "scripting" languages -- Perl, Python, Tcl, etc -- is due to their 
support for quasi-literal expression of text by "interpolation" 
($-substitution) and for quasi-literal text pattern matching by regular 
expressions.

As the Lisp language shows, there's no reason a language can't enable 
literals (or quasi-literals) to express data types other than scalars and 
strings, but few languages do.  Lisp provides for the [quasi-]literal 
expression of lists.  But what about arbitrary user-defined objects?  By 
providing a pluggable quasi-parser framework, each quasi-parser designer can 
define their own syntax for [quasi-]literally expressing whatever data 
types they choose.  In closing below, we propose a radical future/past 
extension of this direction.


			Quasi-Literal Expressions


A quasi-literal expression consists of an optional quasi-parser name 
(defaulting to "simple" if left out), a quasi-quote (also known as 
backquote, "`"), a string in the grammar accepted by that quasi-parser, and 
a closing quasi-quote.  This string may also have several embedded  
"$" <identifier>  or  "$" "{" <E Expression> "}"  indicating values to be 
substituted in.  The quasi-parser sees the string without the embedded 
expression source.  For example:

	foo`some text${bar() + 3}more text$baz`

expands to

	quasiParsers["foo"] valueMaker("some text${0}more text${1}") \
		substitute([bar() + 3, baz])

The quasi-parser named "foo" is looked up, and is asked to created a 
ValueMaker from the transformed string.  The transformation extracts each 
expression and leaves behind a "$" "{" <digit>+ "}", ie, a "substitution 
hole".  Since this applies to all quasi-expressions, all quasi-grammars must 
recognize these substitution holes.  The resulting ValueMaker is then 
invoked with a list of the values of the extracted expressions, which the 
ValueMaker should "plug into" the corresponding substitution holes, 
in order to make whatever kind of value the quasi-parser took to be 
described by the quasi-string.  We currently have two quasi-parsers that 
support quasi-expressions: "simple" and "e".

The "simple" quasi-parser is indeed the simplest useful example consistent 
with these rules.  Except for the embedded holes, it takes the input text to 
just describe itself.  Substitution consists of stringification of the value, 
and string substitution.  It corresponds to the "interpolation" familiar 
from at least Perl, Tcl, and various shells:

	? `abc${2+3}def`
	# value: abc5def

The expression expands to

	quasiParsers["simple"] valueMaker("abc${0}def") substitute([2 + 3])


The "e" quasi-parser is simply E's own parser used as a quasi-parser.  It 
accepts E source-code as its quasi-grammar, and takes it to describe ASTs 
(which we call parse trees) of the post-expansion Kernel E language:

	? def sum := e`x + y`
	# value: e`x add(y)`

	? def prod := e`a * $sum * b`
	# value: e`a multiply(x add(y)) multiply(b)`

In E, "+" and "*" are just syntactic sugar for the "add" and "multiply" 
messages.  (Kernel E parse trees print as e`...` quasi-expressions as 
well.)  Since the substitution is structural, not textual, the fact that "*" 
binds tighter in E than "+" does not cause confusion.  "deriv.e" in the 
distribution is a symbolic differentiation package written in E using the e 
quasi-parser so that algebraic expressions can be represented as trees, but 
expressed in familiar infix-operator notation.

Notice that each time a quasi-expression is evaluated, the valueMaker() 
argument is the same, though the substitute() argument is usually different. 
Modulo some security issues beyond the scope of this email, quasi-parsers 
should therefore cache ValueMakers by transformed string.  It can then spend 
a lot of work "compiling" a ValueMaker so that substitute() can go faster.


			Quasi-Literal Patterns


A quasi-literal pattern (a quasi-pattern) is also written as an optional 
quasi-parser name, a quasi-quote, a quasi-string with embedded stuff, and a 
closing quasi quote.  This time the embedded stuff can consist of several 

		"$"<identifier>
	|	"$" "{" <E expression> "}"
	|	"@"<identifier>
	|	"@" "{" <E pattern> "}"

You can think of E's grammar as having two parsing contexts: expression and 
pattern.  A quasi-literal is parsed and transformed as a quasi-expression 
or a quasi-pattern based on its parsing context.  The quasi-pattern 
transformation is more complex:

	? "abcde" =~ `a@{x}d@y`
	# value: true

	? x
	# value: bc

	? y
	# value: e

As in Perl, "=~" is the pattern match operator.  It takes an expression on 
the left and a pattern on the right.  That first line expands to:

	("abcde" =~ _4q ? (quasiParsers["simple"] matchMaker("a@{0}d@{1}") \
				matchBind([], _4q) =~ [x, y]))

"_4q" is an E-expansion-generated temporary variable name.  The pattern 
constructs needed to understand the above pattern are:

	<identifier>			Always matches, and binds identifier 
					to the specimen.  This is the only form
					of variable definition in E.

	<pattern> "?" <expression>	"?" is pronounced "such that".  First 
					matches <pattern> to the specimen, if
					it succeeds, evaluate <expression> in 
					the resulting scope.  If it's value is 
					true, then the such-that pattern succeeds.

	[<pattern>,...]			Matches a list of the same length by 
					matching each element against the 
					corresponding pattern.

So, reading from the inside out, the quasi-parser named "simple" is asked to 
produce a MatchMaker from the string "a@{0}d@{1}", which is presumed to be 
in the quasi-grammar understood by this quasi-parser.  (Once again, it pays 
for the quasi-parser to cache a compiled MatchMaker indexed by this string.) 
The MatchMaker is then asked to "matchBind(args, specimen)" where the args 
are extracted from embedded $-expressions, as with the earlier
substitute(args).  'args' is here an empty list since there were no 
embedded $-expressions, and the specimen is here "abcde" by the logic of "?".

If the matchBind() fails, it returns null, which will not match any list 
pattern, even the empty list pattern, resulting in the overall match 
failing.  If the matchBind() succeeds, it returns a list with elements 
extracted from the specimen by @-holes, and put into the list at positions 
corresponding to the @-numbers.  The MatchMaker resulting from the "simple" 
quasi-parser just does the simplest useful anti-greedy (ascetic?) string 
match and extract without backtrack consistent with this framework.  Except 
for holes, all characters just represent themselves and must match exactly.

As with quasi-expressions, none of this is specific to text.  A MatchMaker 
can accept whatever kind of object it wishes as specimen and extract 
whatever it wishes as the bindings.  As you might guess, the e quasi-parser 
can also be used to match parse trees and extract subtrees.  A portion of 
deriv.e:


	#
	# What's the derivative of expr wrt var?  Ie, what's "d(expr)/d(var)"?
	#
	define deriv(expr, var) {
	
	    switch (expr) {
	
	        match e`$var ** @exp` ? (isConst(exp))  {
	            e`$exp * $var ** ($exp -1)`
	        }
	        match e`@a + @b` {
	            define da := deriv(a, var)
	            define db := deriv(b, var)
	            e`$da + $db`


The last 3 lines could also have been written

	            e`${deriv(a, var)} + $(deriv(b, var)}`

depending on your taste.  These two examples show the similarity between the 
match-bind-substitute styles of programming as practiced by "scripting" 
languages on text, and that practiced by Prolog on trees of symbols.


		Regular Expressions as Quasi-Patterns


The third quasi-parser currently available in E is named "rx", standing for 
"Regular eXression".  "rx" takes a full Perl-5 regular expression engine 
obtained from http://www.oroinc.com (now http://www.savarese.org ) and wraps 
it to turn it into a quasi-parser.  (This quasi-parser is written in E and 
is found in the E distribution as "PerlMatchMakerMaker.emaker".)  As 
expected, most of the work is in the treatment of holes.  The equivalent of 
the above string match written using "rx" rather than "simple" would be:

	? "abcde" =~ rx`a(@{x}.*?)d(@y.*)`

The expansion is the same of course, except that the "rx" quasi parser is 
looked up instead of "simple", and the transformed string fed to 
matchMaker() is

	"a(@{0}.*?)d(@{1}.*)"

Parentheses play a role in Perl regular expressions related to E's @-holes: 
they extract the part of the specimen matched against the sub-pattern they 
contain.  To generate a proper Perl regular expression, the "rx" wrapper 
removes the @-holes leaving

	"a(.*?)d(.*)"

but verifies that each @-hole was at the beginning of such a parenthesized 
group, and remembers which group it is.  The (cached) MatchMaker consists of 
a compiled Perl Pattern object plus the correspondence between 
paren-group-number and @-number.  On a successful match it returns the 
bindings in a list ordered according to the original @-numbers.

Unfortunately, to do this, "rx" must correctly distinguish between an "(" 
that starts a Perl-saved-group and one that doesn't.  We currently impose 
some restrictions (reject some valid Perl patterns) to make this recognition 
easier, but we still don't always get this right.  Fortunately, the known 
bugs here haven't yet bitten me on any real examples, but this needs to be 
fixed.  Ideally, the OroMatcher code could be enhanced so "rx" could ask 
Pattern where it thinks the open parens of save-groups are.  I'll be asking 
oroinc/savarese.org if they'd be interested it adding this query.


		Long Term Future Directions: Hypercard


Our long term future takes us back to the conversation with Scott Kim 
http://www.scottkim.com/ that pointed us in this direction long ago.  
Hypercard had just come out and was obviously powerful in a way most of 
had not seen before.  Scott had a unique perspective on this power: 

Most programming languages have as literal data types things like numbers, 
characters, and strings.  Lisp, in addition, has lists, but all these only 
express literals using text strings.  Why?  Because the program itself is a 
text string.  A weird pun is going on: when you put the insertion point of 
your text editor between the quotes of a literal string you're effectively 
no longer in a program editor, but in a nested text editor.  The text you 
are editing is -- almost -- simply the text that that literal evaluates to.  
Whereas normally programming is indirect, here we're operating by direct 
manipulation -- creating the actual output directly.

Hypercard is normally conceived of as primarily a visual application thing 
with an embedded silly programming language (Hypertalk).  Think instead of a 
whole Hypercard stack as a mostly literal program.  In addition to 
numbers and strings, the literals you can directly edit include bitmaps, 
forms, buttons, and menus.  You can literally assemble most things, but 
where you need dynamic behavior, there's a hole in the literal filled in by 
a Hypertalk script.  The program editor for this script is experienced as 
being nested inside the direct manipulation user interface editor.

It's one of those 15 minute conversations that change your life.

Though there have been many many visual interface builders since Hypercard, 
I'm not aware of any that have pursued this analogy between partially 
literal user-interface construction and Lisp quasi-quote.  How should E 
eventually pursue this?

Well, we could get fancy and say that source code should stop being text, 
and should instead be a graph of objects.  Unfortunately, though this 
approach has paid off in the past, it is becoming increasingly less possible, 
as the world becomes increasingly more wedded to text.  (Overall net 
progress can obscure many cases of horrifying regress, and the computer 
world is rife with them.)  Fortunately, there's a low road:

The Macintosh Resource Fork and ResEdit was a crude but effective form for 
direct manipulation editing of an extensible set of non-textual literal data 
types.  They fell short of what we seek in only two regards: 1) The literals 
weren't "quasi", ie, they were fully literal, rather than having holes to be 
filled in by more code.  2) They weren't meaningfully embedded within the 
program text -- not visually, and not scoping-wise.  As pure literals, the 
lack of scope embedding doesn't matter.  But once you fix #1, you must make 
sure the embedded expressions are in their lexical context.

Elmer currently uses the Java Swing plain text editor, a sibling of the Java 
Swing compound document editor (of which the wysiwyg html editor is a 
subclass).  Java also has a resource mechanism (which I'd guess was inspired 
by the Macintosh's), and Java can serialize/unserialize object graphs into 
such resources.

A future E compound document editor could generate and recognize, for 
example, the expression

	reseditor`<resource name> foo=${expr0} bar=${expr1}`

unserialize the resource with this name into (hopefully) a ValueMaker that's 
also an editable Model, and embed at this place in the program editor a 
nested editor for this type of Model.  Of course, to preserve the 
quasi-literal illusion, the editor for the Model should seem like the editor 
for the data type the ValueMaker makes (in response to substitute()), but it 
would in turn contain embedded program editors at the positions in the Model 
corresponding to the names "foo" and "bar".  Perhaps the literal plane 
should be shown above the program plane using shadows.  The embedded program 
editors would be shown, in wells using reverse shadows, to be on the same 
plane (and so the same scope) as the surrounding program.  On "Save", the 
editor would serialize the ValueMaker back into resource land.

Similarly, of course, for serialized MatchMakers.

Besides user interface construction, what would you use this for?  How about 
generating graphics?  Right now, you can either write computer graphics code, 
which is very indirect but very general; or use a drawing program, which 
works great as long as what you're drawing fits its limitations.  What if 
you're not even trying to produce a single drawing, but rather to produce a 
program that does data dependent drawing?  Currently you'd be stuck with a 
mostly indirect way of expressing yourself.  It would seem the ability 
to intimately mix elements expressed at different levels of literalness 
could let us use the best of both on the same task.

Or (borrowing slightly from the Garden system) how about embedding a 
graphical state diagram or state chart editor.  The transition arcs could
have holes containing actions expressed in code that's in the scope of the 
lexical context containing the embedded state diagram.


				Acknowledgement


The quasi-literals of E owe a tremendous debt to Alonzo Church, Lisp 
quasi-quote, Scott Kim, Dean Tribble, and Joule.  Thanks!!


E is an Open Everything project, including Open Source.  Feel free to forward.

	Cheers,
	--MarkM