Introduction
On all forums dedicated to Java development, a lot of problems reported by non seasoned Java developers, and even seasoned developers in certain cases, relate to text handling; either text being corrupted in files, or strings not matching the expected content, displays unable to render some text or rendering it incorrectly etc. You name it.
All of these problems stem from a misunderstanding of the fundamental concepts behind what "text" actually is: how you see it on your display, how Java represents it, and how you store it in files/send it over the network (or read it from files/receive it over the network).
This page will try to sum up what you need to know in order to avoid making errors, or at least narrow the problem space.
Some definitions
Unicode and your display
Unicode is quite a complex piece of machinery, but ultimately, you can view it as a dictionary of graphemes covering nearly all human written language. Each grapheme is a combination of one, or more, code points (such as U+0013
). And not all graphemes are letters, spaces or punctuation marks: Unicode also defines code points for smileys, for instance.
Unicode is not set in stone. Different revisions are regularly published, which add new graphemes and therefore new code points. Now, how does this relate to your display? Well, your display may, or may not, have the ability to display this or that grapheme.
What Unicode also defines is character encodings. A character encoding is a way for computing devices to translate a Unicode code point into a sequence of one or more bytes. UTF-8 is such a character encoding (UTF, in UTF-8, means Unicode Transformation Format).
byte and char
A byte
is the basic computer storage unit: 8 bits. The fact that byte
is signed is irrelevant to this discussion. A byte
is 8 bits and that is all there is to it. You don't need to know more ;)
On to char
. And first things first: a char is NOT two bytes It is only two bytes "storage wise". A char is an individual code unit in the UTF-16 character encoding.
At run time, all String
s in Java are sequences of char
s. This include string literals you have in your source files. Whatever the character encoding of your source files. Which nicely leads to...
Charset, encoding and decoding
When you want to write text, whether that be to a file or a socket, what you initially have is some char
sequence. But what you will effectively write are byte
s. Similarly, when you read text, you don't read char
s, you read byte
s.
Which means you need two processes:
- turning a sequence of
char
s into a sequence ofbyte
s: this process is known as encoding; - turning a sequence of
byte
s into a sequence ofchar
s: this process is known as decoding.
These two processes depend on what charset you use; a charset is essentially a byte
<->char
mapping, but not a one-to-one mapping: one char
can give multiple byte
s, and some byte
sequences can be decoded into multiple char
s.
Fundamental Java classes
In Java, you have three classes for each of a charset, an encoder and a decoder:
- a charset is represented by the
Charset
class; - an encoder is represented by a
CharsetEncoder
; - a decoder is represented by a
CharsetDecoder
.
You also have a static method for encoding one single code point into its equivalent UTF-16, char sequence: Character.toChars()
.
InputStream
/OutputStream
vs Reader
/Writer
Again this byte
vs char
analogy here. If you want to read bytes, use an InputStream
; if you want to read char
s, you will use a Reader
. When not reading from a char source, a Reader
will try its best to decode the input byte
s into char
s and what you will get is the result of the Reader
's efforts.
In the same vein, for writing bytes, use an OutputStream
, and use a Writer
if you want to write char
s. If your output is a byte sink, the Writer
will encode the char
s you submit into byte
s and write those bytes.
We can make a crude drawing of what happens when you send text data over the network to a peer, for instance:
SENDER | RECEIVER encodes | decodes char[] --------> byte[] ----> byte[] --------> char[]
Now, in this crude drawing, imagine that the sender encodes using one charset but the receiver decodes using another...
Avoiding such a case is simple enough: always specify the charset when using a Reader
or a Writer
. Otherwise, you will have problems. Eventually. You may not have had problems until now, but you will. This is a guarantee.
Illustration: what not to do
Relying on the default charset
By far the most common thing that can go wrong is code like this:
// DON'T DO THAT -- no charset is specified new FileReader("somefile.txt"); // or new FileWriter() // DON'T DO THAT -- no charset is specified someString.getBytes();
The charset to use is not specified here; which means the default charset is used. and this default charset depends on your JRE/OS environment.
Say you write a file using an implementation using ISO-8859-15 as the default charset; you send this file to a peer whose JRE/OS combination uses UTF-8. Your peer won't be able to read the file correctly...
Avoiding that is simple. Apply the rule above. And if you use Java 7 or greater, use Files
instead. As in:
// This method requires that you specify the charset... Files.newBufferedReader(Paths.get("somefile.txt"), StandardCharsets.UTF_8); someString.getBytes(StandardCharsets.UTF_8);
Using String
s for binary data
The mapping of char
s to byte
s means that only certain sequences of byte
s will be generated by an encoder; and similarly, only these byte
sequences will be readable as char
s.
The following program shows that. It creates a byte array which cannot be fully decoded in UTF-8; it also demonstrates the default behaviour of String
in the event of an unmappable sequence:
import java.nio.ByteBuffer; import java.nio.charset.CharacterCodingException; import java.nio.charset.Charset; import java.nio.charset.CharsetDecoder; import java.nio.charset.CodingErrorAction; import java.nio.charset.StandardCharsets; public final class Main { /* * This byte is unmappable by the UTF-8 encoding */ private static final byte POISON = (byte) 0xfc; public static void main(final String... args) { final Charset charset = StandardCharsets.UTF_8; /* * We create three decoders. Their behaviour only differs * by the way they treat unmappable byte sequences: the * first one will ignore errors, the second one will * replace the unmappable bytes with a default value, and * the third one will throw an exception. */ final CharsetDecoder lossy = charset.newDecoder() .onMalformedInput(CodingErrorAction.IGNORE); final CharsetDecoder lenient = charset.newDecoder() .onMalformedInput(CodingErrorAction.REPLACE); final CharsetDecoder assTight = charset.newDecoder() .onMalformedInput(CodingErrorAction.REPORT); /* * The string we are testing against */ final String original = "Mémé dans les orties"; /* * Encode the test string into a byte array; then allocate a * buffer whose length is that of the encoded array plus 1. * At the end of this buffer, add our poison. */ final byte[] encoded = original.getBytes(charset); final ByteBuffer buf = ByteBuffer.allocate(encoded.length + 1); buf.put(encoded).put(POISON); /* * For reference, let us print the length of our poisoned * array. */ System.out.printf("Original byte array has length %d\n", buf.array().length); /* * Now, attempt to build a string again from our poisoned * byte input. First by invoking the appropriate String * constructor (note that we specify the charset), then * by trying each of the three decoders we have * initialized above. */ System.out.println("--- DECODING TESTS ---"); final String decoded = new String(buf.array(), charset); System.out.printf("String constructor: %s\n", decoded); tryDecoder(lossy, "lossy", buf); tryDecoder(lenient, "lenient", buf); tryDecoder(assTight, "assTight", buf); System.out.println("--- END DECODING TESTS ---"); /* * Now try and regenerate our original byte array. * And weep. */ System.out.printf("Reencoded byte array length: %d\n", decoded.getBytes(charset).length); } private static void tryDecoder(final CharsetDecoder decoder, final String name, final ByteBuffer buf) { buf.rewind(); try { System.out.printf("%s decoder: %s\n", name, decoder.decode(buf)); } catch (CharacterCodingException e) { System.out.printf("%s FAILED! Exception follows...\n", name); e.printStackTrace(System.out); } } }
And the output is...
Original byte array has length 23 --- DECODING TESTS --- String constructor: Mémé dans les orties� lossy decoder: Mémé dans les orties lenient decoder: Mémé dans les orties� assTight FAILED! Exception follows... java.nio.charset.MalformedInputException: Input length = 1 at java.nio.charset.CoderResult.throwException(CoderResult.java:281) at java.nio.charset.CharsetDecoder.decode(CharsetDecoder.java:816) at com.github.fge.Main.tryDecoder(Main.java:87) at com.github.fge.Main.main(Main.java:70) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:483) at com.intellij.rt.execution.application.AppMain.main(AppMain.java:120) --- END DECODING TESTS --- Reencoded byte array length: 25
As you can see, the default behaviour of an encoder is to replace unmappable byte sequences by a default character sequence (a big, fat question mark); you cannot change that default behaviour, but what you can do as this program also illustrates is attempting the decoding operation yourself.
Lame. 0/10
ReplyDelete