2014-03-22

Strings, characters, bytes and character sets: clearing up the confusion

Introduction

On all forums dedicated to Java development, a lot of problems reported by non seasoned Java developers, and even seasoned developers in certain cases, relate to text handling; either text being corrupted in files, or strings not matching the expected content, displays unable to render some text or rendering it incorrectly etc. You name it.

All of these problems stem from a misunderstanding of the fundamental concepts behind what "text" actually is: how you see it on your display, how Java represents it, and how you store it in files/send it over the network (or read it from files/receive it over the network).

This page will try to sum up what you need to know in order to avoid making errors, or at least narrow the problem space.

Some definitions

Unicode and your display

Unicode is quite a complex piece of machinery, but ultimately, you can view it as a dictionary of graphemes covering nearly all human written language. Each grapheme is a combination of one, or more, code points (such as U+0013). And not all graphemes are letters, spaces or punctuation marks: Unicode also defines code points for smileys, for instance.

Unicode is not set in stone. Different revisions are regularly published, which add new graphemes and therefore new code points. Now, how does this relate to your display? Well, your display may, or may not, have the ability to display this or that grapheme.

What Unicode also defines is character encodings. A character encoding is a way for computing devices to translate a Unicode code point into a sequence of one or more bytes. UTF-8 is such a character encoding (UTF, in UTF-8, means Unicode Transformation Format).

byte and char

A byte is the basic computer storage unit: 8 bits. The fact that byte is signed is irrelevant to this discussion. A byte is 8 bits and that is all there is to it. You don't need to know more ;)

On to char. And first things first: a char is NOT two bytes It is only two bytes "storage wise". A char is an individual code unit in the UTF-16 character encoding.

At run time, all Strings in Java are sequences of chars. This include string literals you have in your source files. Whatever the character encoding of your source files. Which nicely leads to...

Charset, encoding and decoding

When you want to write text, whether that be to a file or a socket, what you initially have is some char sequence. But what you will effectively write are bytes. Similarly, when you read text, you don't read chars, you read bytes.

Which means you need two processes:

  • turning a sequence of chars into a sequence of bytes: this process is known as encoding;
  • turning a sequence of bytes into a sequence of chars: this process is known as decoding.

These two processes depend on what charset you use; a charset is essentially a byte<->char mapping, but not a one-to-one mapping: one char can give multiple bytes, and some byte sequences can be decoded into multiple chars.

Fundamental Java classes

In Java, you have three classes for each of a charset, an encoder and a decoder:

You also have a static method for encoding one single code point into its equivalent UTF-16, char sequence: Character.toChars().

InputStream/OutputStream vs Reader/Writer

Again this byte vs char analogy here. If you want to read bytes, use an InputStream; if you want to read chars, you will use a Reader. When not reading from a char source, a Reader will try its best to decode the input bytes into chars and what you will get is the result of the Reader's efforts.

In the same vein, for writing bytes, use an OutputStream, and use a Writer if you want to write chars. If your output is a byte sink, the Writer will encode the chars you submit into bytes and write those bytes.

We can make a crude drawing of what happens when you send text data over the network to a peer, for instance:

        SENDER            |          RECEIVER
        encodes           |           decodes
char[] --------> byte[] ----> byte[] --------> char[]

Now, in this crude drawing, imagine that the sender encodes using one charset but the receiver decodes using another...

Avoiding such a case is simple enough: always specify the charset when using a Reader or a Writer. Otherwise, you will have problems. Eventually. You may not have had problems until now, but you will. This is a guarantee.

Illustration: what not to do

Relying on the default charset

By far the most common thing that can go wrong is code like this:

    // DON'T DO THAT -- no charset is specified
    new FileReader("somefile.txt"); // or new FileWriter()
    // DON'T DO THAT -- no charset is specified
    someString.getBytes();

The charset to use is not specified here; which means the default charset is used. and this default charset depends on your JRE/OS environment.

Say you write a file using an implementation using ISO-8859-15 as the default charset; you send this file to a peer whose JRE/OS combination uses UTF-8. Your peer won't be able to read the file correctly...

Avoiding that is simple. Apply the rule above. And if you use Java 7 or greater, use Files instead. As in:

    // This method requires that you specify the charset...
    Files.newBufferedReader(Paths.get("somefile.txt"), 
        StandardCharsets.UTF_8);
    someString.getBytes(StandardCharsets.UTF_8);

Using Strings for binary data

The mapping of chars to bytes means that only certain sequences of bytes will be generated by an encoder; and similarly, only these byte sequences will be readable as chars.

The following program shows that. It creates a byte array which cannot be fully decoded in UTF-8; it also demonstrates the default behaviour of String in the event of an unmappable sequence:

import java.nio.ByteBuffer;
import java.nio.charset.CharacterCodingException;
import java.nio.charset.Charset;
import java.nio.charset.CharsetDecoder;
import java.nio.charset.CodingErrorAction;
import java.nio.charset.StandardCharsets;

public final class Main
{
    /*
     * This byte is unmappable by the UTF-8 encoding
     */
    private static final byte POISON = (byte) 0xfc;

    public static void main(final String... args)
    {
        final Charset charset = StandardCharsets.UTF_8;

        /*
         * We create three decoders. Their behaviour only differs 
         * by the way they treat unmappable byte sequences: the 
         * first one will ignore errors, the second one will 
         * replace the unmappable bytes with a default value, and
         * the third one will throw an exception.
         */
        final CharsetDecoder lossy = charset.newDecoder()
            .onMalformedInput(CodingErrorAction.IGNORE);

        final CharsetDecoder lenient = charset.newDecoder()
            .onMalformedInput(CodingErrorAction.REPLACE);

        final CharsetDecoder assTight = charset.newDecoder()
            .onMalformedInput(CodingErrorAction.REPORT);

        /*
         * The string we are testing against
         */
        final String original = "Mémé dans les orties";

        /*
         * Encode the test string into a byte array; then allocate a
         * buffer whose length is that of the encoded array plus 1. 
         * At the end of this buffer, add our poison.
         */
        final byte[] encoded = original.getBytes(charset);
        final ByteBuffer buf
            = ByteBuffer.allocate(encoded.length + 1);
        buf.put(encoded).put(POISON);

        /*
         * For reference, let us print the length of our poisoned 
         * array.
         */
        System.out.printf("Original byte array has length %d\n",
            buf.array().length);

        /*
         * Now, attempt to build a string again from our poisoned 
         * byte input. First by invoking the appropriate String 
         * constructor (note that we specify the charset), then 
         * by trying each of the three decoders we have
         * initialized above.
         */

        System.out.println("--- DECODING TESTS ---");
        
        final String decoded = new String(buf.array(), charset);
        System.out.printf("String constructor: %s\n", decoded);
        
        tryDecoder(lossy, "lossy", buf);
        tryDecoder(lenient, "lenient", buf);
        tryDecoder(assTight, "assTight", buf);
        
        System.out.println("--- END DECODING TESTS ---");

        /*
         * Now try and regenerate our original byte array. 
         * And weep.
         */
        System.out.printf("Reencoded byte array length: %d\n",
            decoded.getBytes(charset).length);

    }

    private static void tryDecoder(final CharsetDecoder decoder,
        final String name, final ByteBuffer buf)
    {
        buf.rewind();
        try {
            System.out.printf("%s decoder: %s\n", name, 
                decoder.decode(buf));
        } catch (CharacterCodingException e) {
            System.out.printf("%s FAILED! Exception follows...\n",
                name);
            e.printStackTrace(System.out);
        }
    }
}

And the output is...

Original byte array has length 23
--- DECODING TESTS ---
String constructor: Mémé dans les orties�
lossy decoder: Mémé dans les orties
lenient decoder: Mémé dans les orties�
assTight FAILED! Exception follows...
java.nio.charset.MalformedInputException: Input length = 1
 at java.nio.charset.CoderResult.throwException(CoderResult.java:281)
 at java.nio.charset.CharsetDecoder.decode(CharsetDecoder.java:816)
 at com.github.fge.Main.tryDecoder(Main.java:87)
 at com.github.fge.Main.main(Main.java:70)
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
 at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 at java.lang.reflect.Method.invoke(Method.java:483)
 at com.intellij.rt.execution.application.AppMain.main(AppMain.java:120)
--- END DECODING TESTS ---
Reencoded byte array length: 25

As you can see, the default behaviour of an encoder is to replace unmappable byte sequences by a default character sequence (a big, fat question mark); you cannot change that default behaviour, but what you can do as this program also illustrates is attempting the decoding operation yourself.

That's all folks...

No comments:

Post a Comment