[E-Lang] Operators #2: Comparison

Mark S. Miller markm@caplet.com
Sat, 07 Apr 2001 13:31:54 -0700


--=====================_172616577==_.ALT
Content-Type: text/plain; charset="us-ascii"

At 12:04 PM Saturday 4/7/01, Dean Tribble wrote:
>>    x op__cmp(y)
>>
>>should return either an integer or a float64, which therefore must either be
>>negative, zero, positive, or NaN, indicating that x < y, x <=> y, x > y, or
>>x is incomparable to y, respectively, in this natural order.  (Note that NaN
>>responds to all five zero-comparison queries with false.)
>
>This abstraction keeps making me itch.  One of the red flags is the phrase, 
>"either an integer or a number".

Actually, the phrase is "an integer or a float64".  I would be happy to say 
that the result must be a float64, but this is expensive for fully ordered 
types.  A looser requirement might say that it returns something that acts 
either like either negative, zero, positive or NaN in response to the five 
zero comparison messages, and to the message "isNaN()".  But I think I'd 
rather be more specific -- I see nothing to be gained by the looser spec.


>Another is the pervasive influence of NaN 
>when I've never actually encountered one in many years of programming.  The 
>third is the use of float64 at all, when lots of graphics, etc, profitably 
>use float32.  We'll take up the float question separately.

Just briefly.  We can't not support float64s.  float32s just don't have 
enough precision to get by, and float128s are usually overkill and not 
widely supported.  I also really don't want to have more than one floating 
point type.

High speed floating point code should be written in some lower level 
language like C or Java and linked in to E.  The key is that an E list of 
float64s be an unboxed "array of double", so this low level code can read 
and produce these lists without needless overhead.  If the low level code 
wants to operate on arrays of float32s instead, we could accommodate this 
without introducing a new type, but I'm inclined to make the low level code 
convert at the interface until we find this to be a problem.

See http://www.erights.org/enative/fatpointers.html#boxingData

>This is E.  You already have a concept more general than NaN (albeit 
>inspired by NaN) that seems like it could be used in its place: broken 
>promises.  Use it!  :-)

As I've already discussed with Dean verbally, we've been over this issue 
before on e-lang, and I'm not inclined to revisit it.  It's a difficult 
issue, and what we have works.  Does anyone have thread roots of that 
previous discussion handy?


>Special functions can exist to extract any relevant 
>NaN magic IEEE details for that handful of programmers that will *ever* 
>care, but for the rest of us, NaN is just an exception.  This combination 
>seems like it would simplify the semantics of E that programmers need to pay 
>attention to, and still be better than just about any other programming 
>language out there with regard to NaN and IEEE compliance.

Huh?  It would be worse than Java.  In Java scalar floating point 
arithmetic, NaNs are incomparable with NaNs, and 0.0 is equivalent to -0.0.


>Advantages:
>- comparable things are *reflexively* partially ordered.

Partial order of any variety still requires "x op__cmp(y)" to return 
something that's not negative, zero, or positive.


>>Subset-ness
>>
>>* On Java classes (and by extension perhaps, on declared E types, should we
>>ever declare such things, as we probably will): x <= y means that x is a
>>declared subtype of y.  If y is a class, then x is y or a class that extends
>>a class that's <= y.  If y is an interface, then x is y or x is a class or
>>interface that extends or implements a class that's <= y.  x < y means that
>>x is a declared strict subtype of y, that x < y && !(x <=> y).
>
>Given that E's semantics are based on method names, this is a bad idea.  I 
>could perfectly well have two classes that are subtypes from E's 
>perspective, but unrelated in Java's subset graph.  There are theories of 
>subtyping which could make sense of E's semantics, and they are *not* 
>consistent with the above subtyping rules.  Unless there is a great need for 
>this, don't do it.  (Though feel free to make an unrelated set of methods 
>for asking the above question of Java classes.)

Agreed.  See recent response to Zooko.


>>* On regions: x <= y means that the set of members of region x is a subset
>>of the set of members of region y.
>
>If the ordering is different between two regions in the same coordinate 
>space, are they different?  If not, it might make sense to consider that 
>'<=' for aggregates take enumeration order into account.  That would enable 
>more semantic similarity with Lists, if it could be done.

Regions, not being enumerable, don't need a parameterizable order, so I 
didn't define them to have one.  I think this may be different than Xanadu 
regions, but I like it better.  A coordinate space may very well still have 
orderings associated with it, but these need not be a property of a region.

In any case, if it were, then for two regions r1 and r2 that contained the 
same set of positions but with different orders:

    r1 == r2   ->   false
    r1 <=> r2  ->   true

ie, they're not the same, but they are equivalent (in their natural order, 
which is subset).


>>* On sets: x <= y means x is a subset of y.
>
>>Lists
>>
>>I think this issue breaks my attempt to unify lists and multisets.  Lists
>
>Does it make it more than just theoretically bad?  Specifically, if lists 
>are "multi-sets except for comparison" do things actually break?  will 
>people actually be confused?  The answer may very well be yes, but it's 
>worth asking :-)
>
>I any case, it's worth asking for other aggregates whether they should also 
>take order into account.  The unordered version of "<=" on an aggregate 
>might be the variation you need to get through a comparison operator.

All good and urgent questions.


>One other thing for String comparison:  a default ordering for Strings is 
>only useful for sort-based indexing in collections.  Because of locale 
>issues, the details of any default ordering are never the right thing 
>(except for the programmers locale) for human presentation/searching, 
>especially if you take case-sensitivity issues into account.  Therefore, 
>most string tools parameterized with locale.

I disagree.  People use the default order all the time because it's often 
good enough, and they are understandably too lazy to learn about locales and 
localization.  For example, I think "ls" just uses the default order, and 
people don't scream too much.


        Cheers,
        --MarkM

--=====================_172616577==_.ALT
Content-Type: text/html; charset="us-ascii"

At 12:04 PM Saturday 4/7/01, Dean Tribble wrote:
    x op__cmp(y)

should return either an integer or a float64, which therefore must either be
negative, zero, positive, or NaN, indicating that x < y, x <=> y, x > y, or
x is incomparable to y, respectively, in this natural order.  (Note that NaN
responds to all five zero-comparison queries with false.)

This abstraction keeps making me itch.  One of the red flags is the phrase,
"either an integer or a number".

Actually, the phrase is "an integer or a float64".  I would be happy to say
that the result must be a float64, but this is expensive for fully ordered
types.  A looser requirement might say that it returns something that acts
either like either negative, zero, positive or NaN in response to the five
zero comparison messages, and to the message "isNaN()".  But I think I'd
rather be more specific -- I see nothing to be gained by the looser spec.


Another is the pervasive influence of NaN
when I've never actually encountered one in many years of programming.  The
third is the use of float64 at all, when lots of graphics, etc, profitably
use float32.  We'll take up the float question separately.

Just briefly.  We can't not support float64s.  float32s just don't have
enough precision to get by, and float128s are usually overkill and not
widely supported.  I also really don't want to have more than one floating
point type.

High speed floating point code should be written in some lower level
language like C or Java and linked in to E.  The key is that an E list of
float64s be an unboxed "array of double", so this low level code can read
and produce these lists without needless overhead.  If the low level code
wants to operate on arrays of float32s instead, we could accommodate this
without introducing a new type, but I'm inclined to make the low level code
convert at the interface until we find this to be a problem.

See http://www.erights.org/enative/fatpointers.html#boxingData

This is E.  You already have a concept more general than NaN (albeit
inspired by NaN) that seems like it could be used in its place: broken
promises.  Use it!  :-)

As I've already discussed with Dean verbally, we've been over this issue
before on e-lang, and I'm not inclined to revisit it.  It's a difficult
issue, and what we have works.  Does anyone have thread roots of that
previous discussion handy?


Special functions can exist to extract any relevant
NaN magic IEEE details for that handful of programmers that will *ever*
care, but for the rest of us, NaN is just an exception.  This combination
seems like it would simplify the semantics of E that programmers need to pay
attention to, and still be better than just about any other programming
language out there with regard to NaN and IEEE compliance.

Huh?  It would be worse than Java.  In Java scalar floating point
arithmetic, NaNs are incomparable with NaNs, and 0.0 is equivalent to -0.0.


Advantages:
- comparable things are *reflexively* partially ordered.

Partial order of any variety still requires "x op__cmp(y)" to return
something that's not negative, zero, or positive.


Subset-ness

* On Java classes (and by extension perhaps, on declared E types, should we
ever declare such things, as we probably will): x <= y means that x is a
declared subtype of y.  If y is a class, then x is y or a class that extends
a class that's <= y.  If y is an interface, then x is y or x is a class or
interface that extends or implements a class that's <= y.  x < y means that
x is a declared strict subtype of y, that x < y && !(x <=> y).

Given that E's semantics are based on method names, this is a bad idea.  I
could perfectly well have two classes that are subtypes from E's
perspective, but unrelated in Java's subset graph.  There are theories of
subtyping which could make sense of E's semantics, and they are *not*
consistent with the above subtyping rules.  Unless there is a great need for
this, don't do it.  (Though feel free to make an unrelated set of methods
for asking the above question of Java classes.)

Agreed.  See recent response to Zooko.


* On regions: x <= y means that the set of members of region x is a subset
of the set of members of region y.

If the ordering is different between two regions in the same coordinate
space, are they different?  If not, it might make sense to consider that
'<=' for aggregates take enumeration order into account.  That would enable
more semantic similarity with Lists, if it could be done.

Regions, not being enumerable, don't need a parameterizable order, so I
didn't define them to have one.  I think this may be different than Xanadu
regions, but I like it better.  A coordinate space may very well still have
orderings associated with it, but these need not be a property of a region.

In any case, if it were, then for two regions r1 and r2 that contained the
same set of positions but with different orders:

    r1 == r2   ->   false
    r1 <=> r2  ->   true

ie, they're not the same, but they are equivalent (in their natural order,
which is subset).


* On sets: x <= y means x is a subset of y.

Lists

I think this issue breaks my attempt to unify lists and multisets.  Lists

Does it make it more than just theoretically bad?  Specifically, if lists
are "multi-sets except for comparison" do things actually break?  will
people actually be confused?  The answer may very well be yes, but it's
worth asking :-)

I any case, it's worth asking for other aggregates whether they should also
take order into account.  The unordered version of "<=" on an aggregate
might be the variation you need to get through a comparison operator.

All good and urgent questions.


One other thing for String comparison:  a default ordering for Strings is
only useful for sort-based indexing in collections.  Because of locale
issues, the details of any default ordering are never the right thing
(except for the programmers locale) for human presentation/searching,
especially if you take case-sensitivity issues into account.  Therefore,
most string tools parameterized with locale.

I disagree.  People use the default order all the time because it's often
good enough, and they are understandably too lazy to learn about locales and
localization.  For example, I think "ls" just uses the default order, and
people don't scream too much.


        Cheers,
        --MarkM
--=====================_172616577==_.ALT--