[e-lang] Source code character sets and Unicode

Mark S. Miller markm at cs.jhu.edu
Mon May 28 23:48:17 EDT 2007


Ka-Ping Yee wrote:
> On Mon, 28 May 2007, Mark S. Miller wrote:
>> I have been thinking of requiring something like a
>>
>>      pragma.charset("unicode")
>>
>> before allowing non-ascii characters. This discussion is probably
>> an opportune time to decide on this matter.
> 
> I'm thinking of something like:
> 
> 1.  Choose a "basic character set" whose characters are all
>     clearly distinguishable in (almost all) commonly available
>     fonts on every supported platform, and whose characters are all
>     supported by almost all editors (e.g. ASCII or Wysiwyg-ASCII).
> 
> 2.  By default, forbid non-basic characters in source files.
> 
> 3.  Choose a well-defined, bounded region of the source file,
>     preferably guaranteed to be visible in common contexts (e.g.
>     the first 80 characters or first two lines, whichever is shorter).
> 
> 4.  Always allow only the basic character set within that region,
>     regardless of any configurable settings or declarations.
> 
> 5.  Require a declaration to appear within that region in order to
>     allow non-basic characters in the rest of the source file.

E tries for a certain uniformity between source-file rules, interactive 
command-line rules, and updoc rules. In one sense, updoc scripts are a kind of 
source file. But updoc scripts illustrate command-line interactions, and so 
must reflect their rules accurately.

So in order to answer your question as applied to updoc scripts, we must also 
answer for the command line. That's why the "must appear in the first two 
lines" or something would be awkward for us. We haven't adopted this anywhere 
else. Even

     pragma.syntax("0.9")

can currently appear farther down the page. However, we probably will adopt a 
rule that a pragma must appear at top level, i.e., not nested within a top 
level expression.

With this adjustment, I believe our proposals are in agreement so far. Our 
"basic character set" is ASCII, not Wysiwyg-ASCII. The region would be from 
the top of the file to the first pragma that changed the character set.


> Side question: why does Wysiwyg-ASCII allow '\r'? 

Because on Windows it is too hard to prevent other tools from writing \r\n for 
a newline.


> Why not simply
> define Wysiwyg-ASCII as the regular expression [\n -~]*?

This regular expression does not prohibit trailing whitespace. Since E has 
multi-line string literals, such trailing whitespace is semantically 
significant, but is invisible on the page. By ensuring that a file is in 
Wysiwyg-ASCII before printing, you can avoid this hazard.

-- 
Text by me above is hereby placed in the public domain

     Cheers,
     --MarkM


More information about the e-lang mailing list