[e-lang] Source code character sets and Unicode
Mark S. Miller
markm at cs.jhu.edu
Mon May 28 23:48:17 EDT 2007
Ka-Ping Yee wrote:
> On Mon, 28 May 2007, Mark S. Miller wrote:
>> I have been thinking of requiring something like a
>>
>> pragma.charset("unicode")
>>
>> before allowing non-ascii characters. This discussion is probably
>> an opportune time to decide on this matter.
>
> I'm thinking of something like:
>
> 1. Choose a "basic character set" whose characters are all
> clearly distinguishable in (almost all) commonly available
> fonts on every supported platform, and whose characters are all
> supported by almost all editors (e.g. ASCII or Wysiwyg-ASCII).
>
> 2. By default, forbid non-basic characters in source files.
>
> 3. Choose a well-defined, bounded region of the source file,
> preferably guaranteed to be visible in common contexts (e.g.
> the first 80 characters or first two lines, whichever is shorter).
>
> 4. Always allow only the basic character set within that region,
> regardless of any configurable settings or declarations.
>
> 5. Require a declaration to appear within that region in order to
> allow non-basic characters in the rest of the source file.
E tries for a certain uniformity between source-file rules, interactive
command-line rules, and updoc rules. In one sense, updoc scripts are a kind of
source file. But updoc scripts illustrate command-line interactions, and so
must reflect their rules accurately.
So in order to answer your question as applied to updoc scripts, we must also
answer for the command line. That's why the "must appear in the first two
lines" or something would be awkward for us. We haven't adopted this anywhere
else. Even
pragma.syntax("0.9")
can currently appear farther down the page. However, we probably will adopt a
rule that a pragma must appear at top level, i.e., not nested within a top
level expression.
With this adjustment, I believe our proposals are in agreement so far. Our
"basic character set" is ASCII, not Wysiwyg-ASCII. The region would be from
the top of the file to the first pragma that changed the character set.
> Side question: why does Wysiwyg-ASCII allow '\r'?
Because on Windows it is too hard to prevent other tools from writing \r\n for
a newline.
> Why not simply
> define Wysiwyg-ASCII as the regular expression [\n -~]*?
This regular expression does not prohibit trailing whitespace. Since E has
multi-line string literals, such trailing whitespace is semantically
significant, but is invisible on the page. By ensuring that a file is in
Wysiwyg-ASCII before printing, you can avoid this hazard.
--
Text by me above is hereby placed in the public domain
Cheers,
--MarkM
More information about the e-lang
mailing list