[e-lang] Draft: Handling Symbolic Data

David Hopwood david.nospam.hopwood at blueyonder.co.uk
Sun Apr 18 22:31:31 EDT 2004


Mark S. Miller wrote:
> Runtime story:
> 
> A char may hold any Unicode code point, i.e., 
> any integer in the range 0..0x10FFFF (like an unsigned 21 bit integer). 
> However, a char is still distinct from an int in just the way it is now. A 
> String may be any sequence of any chars.

The problem with this is that it effectively requires a UTF-32 representation
if full Unicode is to be supported -- because it is impractically inefficient
to index based on code points for UTF-8 or UTF-16.

A more pragmatic approach would be to define a string as being a sequence of
code points, tagged with a charset. To iterate over the code points of a
string independent of its charset, we could provide a virtual mapping from
indices to code points, used something like this:

   for index => cp in str.codepointIterator() {
       # the first code unit of the character cp is at str[index]
   }

(It is possible to lazily convert from almost all charsets to Unicode
with only constant lookahead.)

Code that used regexps to manipulate strings, or that only needed to append
strings represented in the same charset, would be independent of which charset
that is (much like in Perl, and I think also Python and Ruby).

> E 0.9 implementation limit: A representable char may only be in the range 
> 0..0xFFFF (like an unsigned 16 bit int). An attempt to create a char larger 
> than is representable within an E platform's implementation limits MUST 
> signal an exception.
> 
> Source story:
> 
> We modify the explanation at http://www.erights.org/data/common-syntax.html 
> as follows: Where it specifies that a source text should undergo a UTF-8 
> decode, we instead specify that a source text should first undergo a UTF-8 
> decode, then (logically afterwards), any valid UTF-16 surrogate pairs should 
> be combined. This allows both ways of encoding a large (0x10000..0x10FFFF) 
> character to be equivalent: as a direct UTF-8 encoding of the large 
> character, or as a UTF-16 encoded pair of surrogates, each in turn encoded 
> in UTF-8.

The Unicode standard was changed to explicitly forbid doing this (for
security reasons, the issue being that it is bad for some programs to
recognise such sequences and some not). UTF-16-encoded-as-UTF-8 is
ill-formed, and MUST be rejected.

I don't see any reason not to just require standard UTF-8 for source code.
String literals should be converted from UTF-8 to whatever encoding the E
implementation prefers for ASTs.

> Should any surrogate code points remain after these steps, the 
> source text MUST be statically rejected.
> 
> E 0.9 implementation limit: If the above decoding would result in a 
> character larger than an E platform's implementation limits, that source 
> text MUST be statically rejected. Given that E 0.9's limits are 0..0xFFFF, 
> it suffices to do a UTF-8 decode and reject the source text if this produces 
> either any large characters or any surrogate code points.

Agreed for the time being.

-- 
David Hopwood <david.nospam.hopwood at blueyonder.co.uk>





More information about the e-lang mailing list