URI Expressions (was: list slicing and splicing)

Mark S. Miller markm@erights.org
Tue, 03 Nov 1998 19:12:55 -0800


At 12:37 AM 10/15/98 , Ka-Ping Yee wrote:
>> 	uriGetters["x"]["y"]
>
>When i first saw uriGetter[] i read uriGeller[] and did
>a double-take. :)

[?] Is that an emoticon at the end of your sentence or a bentspoonicon?


>*sigh*  I have been slightly uncomfortable about the idea
>of allowing bare URLs in the language for a little while.  I do
>think it is extremely cool to have some support for accessing URLs...
>but have you considered what happens when the URL is, say,
>
>    http://site.com/go.cgi?words=one%2ftwo+three&area=all#anchor
>         ^                ^         ^     ^       ^      ^
>      comment?        suchthat?    operators?   slot? comment?
>
>Note that RFC1738 also permits ; : @ $ ! * ' ( ) ,
>
>Luckily { } \ are outlawed.
>
>The parentheses may confuse editors; the apostrophe may confuse
>font-lock; the comma will require that you put a space after
>a URL if you want to use it as an argument, e.g.:

[?] RFC1738, hmmm?  How about RFC2396?
I've been using http://www.ics.uci.edu/pub/ietf/uri/rfc2396.txt which
includes the above characters and several more, but (importantly for E)
excludes the comma.  Which is the right RFC?  Assuming that the comma is
properly excluded, my simple solution to the ambiguity issue is to require
the first non-uri character to be a comma or whitespace, and to disallow
unbalanced parens.  Here's some text from the E book draft that'll go up on
the erights web site as soon as Jonathan & I get some cvs weirdness
straightened out.  "***" indicates notes to myself that may not be
appropriate for the book.


---------------------------------------------------------------------

The URLs familiar from web browsing -- such as http://www.erights.org
-- are actually a special case of a syntax known as a URI. (Not that
it matters, but URL stands for Uniform Resource Locator, while URI
stands for Uniform Resource Identifier.) E's grammar recognizes legals
URIs (and therefore, legal URLs) as URI literal expressions. A URI
literal expression starts with a protocol-identifier, such as "http"
or "file", is followed immediately by a colon, which is immediately
followed by the uri-body characters, such as "//www.erights.org" or
"/jabberwocky.txt":

       protocol-identifier:uri-body

The protocol-identifier is a simple identifier that names a
URI-protocol handler. It is up to the named handler to interpret the
uri-body and return the value of the URI expression.  The value of a
URI expression will usually depend on which protocol handler is named.
The URI body must begin with one of these characters:

       a-z, A-Z, 0-9, any of ;/?@&+$-_.!~*'().%#\|

After the first character, the URI body consists of characters from
this set, plus "=" and ":". The URI literal expression is terminated
by a comma, newline, or whitespace character (so your readers don't
need to remember what characters are not in the above set in order to
know when the URI literal expression is over).

*** details: E doesn't quite recognize all legal URIs. The
exceptions are the exclusion of "=" and ":" from the first uri-body
character -- since it would cause confusion with ":=" (assignment) and
"::" (reserved for future use), and disallowing unbalanced parens,
since it's a common error to leave out the required space between the
URI and an E paren. Better to catch the syntax error where the actual
mistake happened. ***

*** details: \ and | are not actually legal URI characters, but
appear in Netscape and/or Internet Explorer (clarify this) to mean /
and : respectively. Accordingly, E normalizes \ to / and | to : before
passing these to the protocol handler. ***