[e-lang] Joe-E ByteArray stream interface and character encoding/decoding
David Hopwood
david.hopwood at industrial-designers.co.uk
Thu Oct 18 21:27:23 EDT 2007
Adrian Mettler wrote:
> Here are the proposed implementations for stream adapter logic in
> ByteArray and character encoding/decoding routines to and from ASCII and
> UTF-8 in Joe-E. Tyler suggested that I let you take a look at them and
> see if there are any unintended bugs or sources of nondeterminism.
/**
* Decodes a UTF-8 string. Each byte not corresponding to a UTF-8
* character decodes to the Unicode replacement character U+FFFD.
* Note that an initial byte-order mark is not stripped.
* @parameter buffer the ASCII-encoded string to decode
* @parameter off where to start decoding
* @parameter len how many bytes to decode
* @return The corresponding string
* @throws java.lang.IndexOutOfBoundsException
*/
static public String decode(byte[] buffer, int off, int len) {
return charset.decode(ByteBuffer.wrap(buffer, off, len)).toString();
}
This implementation doesn't quite meet the spec in the comment, since Java's
UTF-8 decoder (at least in Sun's implementation) doesn't produce one U+FFFD
for every byte that is not part of a correct encoding. It seems to scan the
array and produce one U+FFFD whenever it sees a byte that makes the encoding
invalid, discarding any previous bytes that would *otherwise* have been
valid.
For example,
Charset.forName("UTF-8").decode(ByteBuffer.wrap(
new byte[] {(byte)0xEF, (byte)0xFF}, 0, 2)).toString()
returns "\uFFFD", not "\uFFFD\uFFFD" (the 0xEF would be valid as the first
byte of an encoded character, so it is discarded).
--
David Hopwood <david.hopwood at industrial-designers.co.uk>
More information about the e-lang
mailing list