[e-lang] Joe-E ByteArray stream interface and character encoding/decoding

David Hopwood david.hopwood at industrial-designers.co.uk
Thu Oct 18 21:27:23 EDT 2007


Adrian Mettler wrote:
> Here are the proposed implementations for stream adapter logic in
> ByteArray and character encoding/decoding routines to and from ASCII and
> UTF-8 in Joe-E.  Tyler suggested that I let you take a look at them and
> see if there are any unintended bugs or sources of nondeterminism.

    /**
     * Decodes a UTF-8 string. Each byte not corresponding to a UTF-8
     * character decodes to the Unicode replacement character U+FFFD.
     * Note that an initial byte-order mark is not stripped.
     * @parameter buffer    the ASCII-encoded string to decode
     * @parameter off       where to start decoding
     * @parameter len       how many bytes to decode
     * @return The corresponding string
     * @throws java.lang.IndexOutOfBoundsException
     */
    static public String decode(byte[] buffer, int off, int len) {
        return charset.decode(ByteBuffer.wrap(buffer, off, len)).toString();
    }

This implementation doesn't quite meet the spec in the comment, since Java's
UTF-8 decoder (at least in Sun's implementation) doesn't produce one U+FFFD
for every byte that is not part of a correct encoding. It seems to scan the
array and produce one U+FFFD whenever it sees a byte that makes the encoding
invalid, discarding any previous bytes that would *otherwise* have been
valid.

For example,

  Charset.forName("UTF-8").decode(ByteBuffer.wrap(
    new byte[] {(byte)0xEF, (byte)0xFF}, 0, 2)).toString()

returns "\uFFFD", not "\uFFFD\uFFFD" (the 0xEF would be valid as the first
byte of an encoded character, so it is discarded).

-- 
David Hopwood <david.hopwood at industrial-designers.co.uk>




More information about the e-lang mailing list