2014-03-22

Strings, characters, bytes and character sets: clearing up the confusion

Introduction

On all forums dedicated to Java development, a lot of problems reported by non seasoned Java developers, and even seasoned developers in certain cases, relate to text handling; either text being corrupted in files, or strings not matching the expected content, displays unable to render some text or rendering it incorrectly etc. You name it.

All of these problems stem from a misunderstanding of the fundamental concepts behind what "text" actually is: how you see it on your display, how Java represents it, and how you store it in files/send it over the network (or read it from files/receive it over the network).

This page will try to sum up what you need to know in order to avoid making errors, or at least narrow the problem space.

Some definitions

Unicode and your display

Unicode is quite a complex piece of machinery, but ultimately, you can view it as a dictionary of graphemes covering nearly all human written language. Each grapheme is a combination of one, or more, code points (such as U+0013). And not all graphemes are letters, spaces or punctuation marks: Unicode also defines code points for smileys, for instance.

Unicode is not set in stone. Different revisions are regularly published, which add new graphemes and therefore new code points. Now, how does this relate to your display? Well, your display may, or may not, have the ability to display this or that grapheme.

What Unicode also defines is character encodings. A character encoding is a way for computing devices to translate a Unicode code point into a sequence of one or more bytes. UTF-8 is such a character encoding (UTF, in UTF-8, means Unicode Transformation Format).

byte and char

A byte is the basic computer storage unit: 8 bits. The fact that byte is signed is irrelevant to this discussion. A byte is 8 bits and that is all there is to it. You don't need to know more ;)

On to char. And first things first: a char is NOT two bytes It is only two bytes "storage wise". A char is an individual code unit in the UTF-16 character encoding.

At run time, all Strings in Java are sequences of chars. This include string literals you have in your source files. Whatever the character encoding of your source files. Which nicely leads to...

Charset, encoding and decoding

When you want to write text, whether that be to a file or a socket, what you initially have is some char sequence. But what you will effectively write are bytes. Similarly, when you read text, you don't read chars, you read bytes.

Which means you need two processes:

  • turning a sequence of chars into a sequence of bytes: this process is known as encoding;
  • turning a sequence of bytes into a sequence of chars: this process is known as decoding.

These two processes depend on what charset you use; a charset is essentially a byte<->char mapping, but not a one-to-one mapping: one char can give multiple bytes, and some byte sequences can be decoded into multiple chars.

Fundamental Java classes

In Java, you have three classes for each of a charset, an encoder and a decoder:

You also have a static method for encoding one single code point into its equivalent UTF-16, char sequence: Character.toChars().

InputStream/OutputStream vs Reader/Writer

Again this byte vs char analogy here. If you want to read bytes, use an InputStream; if you want to read chars, you will use a Reader. When not reading from a char source, a Reader will try its best to decode the input bytes into chars and what you will get is the result of the Reader's efforts.

In the same vein, for writing bytes, use an OutputStream, and use a Writer if you want to write chars. If your output is a byte sink, the Writer will encode the chars you submit into bytes and write those bytes.

We can make a crude drawing of what happens when you send text data over the network to a peer, for instance:

        SENDER            |          RECEIVER
        encodes           |           decodes
char[] --------> byte[] ----> byte[] --------> char[]

Now, in this crude drawing, imagine that the sender encodes using one charset but the receiver decodes using another...

Avoiding such a case is simple enough: always specify the charset when using a Reader or a Writer. Otherwise, you will have problems. Eventually. You may not have had problems until now, but you will. This is a guarantee.

Illustration: what not to do

Relying on the default charset

By far the most common thing that can go wrong is code like this:

    // DON'T DO THAT -- no charset is specified
    new FileReader("somefile.txt"); // or new FileWriter()
    // DON'T DO THAT -- no charset is specified
    someString.getBytes();

The charset to use is not specified here; which means the default charset is used. and this default charset depends on your JRE/OS environment.

Say you write a file using an implementation using ISO-8859-15 as the default charset; you send this file to a peer whose JRE/OS combination uses UTF-8. Your peer won't be able to read the file correctly...

Avoiding that is simple. Apply the rule above. And if you use Java 7 or greater, use Files instead. As in:

    // This method requires that you specify the charset...
    Files.newBufferedReader(Paths.get("somefile.txt"), 
        StandardCharsets.UTF_8);
    someString.getBytes(StandardCharsets.UTF_8);

Using Strings for binary data

The mapping of chars to bytes means that only certain sequences of bytes will be generated by an encoder; and similarly, only these byte sequences will be readable as chars.

The following program shows that. It creates a byte array which cannot be fully decoded in UTF-8; it also demonstrates the default behaviour of String in the event of an unmappable sequence:

import java.nio.ByteBuffer;
import java.nio.charset.CharacterCodingException;
import java.nio.charset.Charset;
import java.nio.charset.CharsetDecoder;
import java.nio.charset.CodingErrorAction;
import java.nio.charset.StandardCharsets;

public final class Main
{
    /*
     * This byte is unmappable by the UTF-8 encoding
     */
    private static final byte POISON = (byte) 0xfc;

    public static void main(final String... args)
    {
        final Charset charset = StandardCharsets.UTF_8;

        /*
         * We create three decoders. Their behaviour only differs 
         * by the way they treat unmappable byte sequences: the 
         * first one will ignore errors, the second one will 
         * replace the unmappable bytes with a default value, and
         * the third one will throw an exception.
         */
        final CharsetDecoder lossy = charset.newDecoder()
            .onMalformedInput(CodingErrorAction.IGNORE);

        final CharsetDecoder lenient = charset.newDecoder()
            .onMalformedInput(CodingErrorAction.REPLACE);

        final CharsetDecoder assTight = charset.newDecoder()
            .onMalformedInput(CodingErrorAction.REPORT);

        /*
         * The string we are testing against
         */
        final String original = "Mémé dans les orties";

        /*
         * Encode the test string into a byte array; then allocate a
         * buffer whose length is that of the encoded array plus 1. 
         * At the end of this buffer, add our poison.
         */
        final byte[] encoded = original.getBytes(charset);
        final ByteBuffer buf
            = ByteBuffer.allocate(encoded.length + 1);
        buf.put(encoded).put(POISON);

        /*
         * For reference, let us print the length of our poisoned 
         * array.
         */
        System.out.printf("Original byte array has length %d\n",
            buf.array().length);

        /*
         * Now, attempt to build a string again from our poisoned 
         * byte input. First by invoking the appropriate String 
         * constructor (note that we specify the charset), then 
         * by trying each of the three decoders we have
         * initialized above.
         */

        System.out.println("--- DECODING TESTS ---");
        
        final String decoded = new String(buf.array(), charset);
        System.out.printf("String constructor: %s\n", decoded);
        
        tryDecoder(lossy, "lossy", buf);
        tryDecoder(lenient, "lenient", buf);
        tryDecoder(assTight, "assTight", buf);
        
        System.out.println("--- END DECODING TESTS ---");

        /*
         * Now try and regenerate our original byte array. 
         * And weep.
         */
        System.out.printf("Reencoded byte array length: %d\n",
            decoded.getBytes(charset).length);

    }

    private static void tryDecoder(final CharsetDecoder decoder,
        final String name, final ByteBuffer buf)
    {
        buf.rewind();
        try {
            System.out.printf("%s decoder: %s\n", name, 
                decoder.decode(buf));
        } catch (CharacterCodingException e) {
            System.out.printf("%s FAILED! Exception follows...\n",
                name);
            e.printStackTrace(System.out);
        }
    }
}

And the output is...

Original byte array has length 23
--- DECODING TESTS ---
String constructor: Mémé dans les orties�
lossy decoder: Mémé dans les orties
lenient decoder: Mémé dans les orties�
assTight FAILED! Exception follows...
java.nio.charset.MalformedInputException: Input length = 1
 at java.nio.charset.CoderResult.throwException(CoderResult.java:281)
 at java.nio.charset.CharsetDecoder.decode(CharsetDecoder.java:816)
 at com.github.fge.Main.tryDecoder(Main.java:87)
 at com.github.fge.Main.main(Main.java:70)
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
 at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 at java.lang.reflect.Method.invoke(Method.java:483)
 at com.intellij.rt.execution.application.AppMain.main(AppMain.java:120)
--- END DECODING TESTS ---
Reencoded byte array length: 25

As you can see, the default behaviour of an encoder is to replace unmappable byte sequences by a default character sequence (a big, fat question mark); you cannot change that default behaviour, but what you can do as this program also illustrates is attempting the decoding operation yourself.

That's all folks...

2014-03-17

Working with the Java 7 file API: recursive copy and deletion

Introduction

In one of my previous posts, I compared the file API in Java 6 and the new file API in Java 7.

You will have noticed, if you have been curious enough to read the Javadoc an tried it, that there are still two convenience methods missing from the new API (and they were not in the old API either): recursive copy or deletion of a directory.

In this post, I will show an implementation of both. The basis for both of these is to use the Files.walkFileTree() method. Copying and deleting is therefore "only" a matter of implementing the FileVisitor interface.

There are limitations to both; refer to the end of this post for more details.

Recursive copy

Here is the code of a FileVisitor for recursive copying:

public final class CopyFileVisitor
    implements FileVisitor<Path>
{
    private final Path srcdir;
    private final Path dstdir;

    public CopyFileVisitor(final Path srcdir, final Path dstdir)
    {
        this.srcdir = srcdir.toAbsolutePath();
        this.dstdir = dstdir.toAbsolutePath();
    }

    @Override
    public FileVisitResult preVisitDirectory(final Path dir,
        final BasicFileAttributes attrs)
        throws IOException
    {
        Files.createDirectories(toDestination(dir));
        return FileVisitResult.CONTINUE;
    }

    @Override
    public FileVisitResult visitFile(final Path file,
        final BasicFileAttributes attrs)
        throws IOException
    {
        Files.copy(file, toDestination(file));
        return FileVisitResult.CONTINUE;
    }

    @Override
    public FileVisitResult visitFileFailed(final Path file,
        final IOException exc)
        throws IOException
    {
        throw exc;
    }

    @Override
    public FileVisitResult postVisitDirectory(final Path dir,
        final IOException exc)
        throws IOException
    {
        if (exc != null)
            throw exc;
        return FileVisitResult.CONTINUE;
    }

    private Path toDestination(final Path victim)
    {
        final Path tmp = victim.toAbsolutePath();
        final Path rel = srcdir.relativize(tmp);
        return dstdir.resolve(rel.toString());
    }
}

In order to use it, you would then do:

final Path srcdir = Paths.get("/the/source/dir");
final Path dstdir = Paths.get("/the/destination/dir");
Files.walkFileTree(srcdir, new CopyFileVisitor(srcdir, dstdir);

Recursive deletion

Here is the code of a FileVisitor for recursive deletion:

public final class DeletionFileVisitor
    implements FileVisitor<Path>
{
    @Override
    public FileVisitResult preVisitDirectory(final Path dir,
        final BasicFileAttributes attrs)
        throws IOException
    {
        return FileVisitResult.CONTINUE;
    }

    @Override
    public FileVisitResult visitFile(final Path file,
        final BasicFileAttributes attrs)
        throws IOException
    {
        Files.delete(file);
        return FileVisitResult.CONTINUE;
    }

    @Override
    public FileVisitResult visitFileFailed(final Path file,
        final IOException exc)
        throws IOException
    {
        throw exc;
    }

    @Override
    public FileVisitResult postVisitDirectory(final Path dir,
        final IOException exc)
        throws IOException
    {
        if (exc != null)
            throw exc;
        Files.delete(dir);
        return FileVisitResult.CONTINUE;
    }
}

To use it:

final Path victim = Paths.get("/directory/to/delete");

Files.walkFileTree(victim, new DeleteFileVisitor());

Limitations

The implementation of recursive copy is limited to paths on the same filesystem. Indeed, you cannot .resolve() a path issued from another filesystem... More on that in another post.

The recursive deletion will stop at the first element (file or directory) which fails to be deleted.

That's all folks...

2014-03-14

Working with files: Java 6 versus Java 7

Introduction

When searching for examples to manipulate files on the net, most examples found use either of:

  • Java 6's old file API,
  • a utility library (Apache commons mostly).

But Java 8 is out now, and it seems people still haven't gotten to grasps with Java 7's new file API... This post aims to do two things:

  • describe some of the advantages of the new API;
  • give examples of Java 6 code and the equivalent (better!) Java 7 code.

Hopefully, after reading this, you will ditch the old API! Which you should, really.

Part 1: advantages of the new API

Meaningful exceptions

Oh boy is that missing from the old API.

Basically, any filesystem-level error (permission denied, file does not exist etc) with the old API would throw FileNotFoundException. Not informative at all. Making sense out of this exception basically requires that you dig into the error message.

With the new API, that changes: you have FileSystemException. And its subclasses have equally meaningful names: NoSuchFileException, AccessDeniedException, NotDirectoryException etc.

Common filesystem operations now throw exceptions

A very common bug with Java 6 code (and which external libraries have struggled to work around) is to not check for return values of many methods on File objects. Examples of such methods are file.delete(), file.createNewFile(), file.move(), file.mkdirs() etc.

Not anymore. For instance, Files.delete() throws an exception on failure; and the exception thrown will be meaningful! You will know whether, for instance, you attempted to delete a non empty directory.

Useful shortcut methods

Just an example: Files.copy()! Several of these shortcut methods will be shown in the examples below.

Less room for errors when dealing with text data

Have you ever been hit by code using a Reader or a Writer without specifying the encoding to use when reading/writing to files?

Well, the good news is that Files methods opening readers or writers (resp. Files.newBufferedReader() and Files.newBufferedWriter()) require you to specify the encoding!

This is also the case of the Files.readAllLines() method.

Advanced usage: filesystem implementations

Here, a filesystem does not mean only an on-disk format of storing files; provided you want to implement it, or find an existing implementation, it can be anything you like: an FTP server, a CIFS filesystem, etc.

Or even a ZIP file (therefore, jars, wars, ears etc as well); in fact, Oracle provides a filesystem implementation for these.

Part 2: sample usages

Abstract path names

In Java 6, it was File. In Java 7, you use Path.

For backwards compatibility reasons, you can convert one to the other and vice versa: file.toPath(), path.toFile().

To create a path in Java 7, use Paths.get().

Operations on abstract paths

The table below lists operations on File objects, and their equivalent using Java 7's Files class. There are fundamental differences between Java 6 and Java 7 here:

  • these operations in Java 7 require that you create the Path object;
  • as mentioned above, Java 7 operations will throw a (meaningful!) exception on failure; Java 6 operations return a boolean which should be checked for, but which most people forget to check for...
  • For all creation operations using Java 7, you can specify file attributes; those are filesystem dependent, and specifying an attribute which is not supported by the filesystem implementation will throw an unchecked exception.
Java 6Java 7Differences
file.createNew() Files.createFile() See above
file.mkdir() Files.createDirectory() See above
file.mkdirs() Files.createDirectories() See above
file.exists() Files.exists() Java 7 supports symbolic links; you can therefore check whether the link itself exists by adding the LinkOption.NO_FOLLOW_LINKS option, regardless of whether the link target exists. On filesystems without symlink support, this option has no effect.
file.delete() Files.delete() Java 7 also has Files.deleteIfExists().
file.isFile(), file.isDirectory() Files.isRegularFile(), Files.isDirectory()
  • Here also, symlink support makes a difference. Symlinks are followed by default, if you do not want to follow them, specify the LinkOption.NO_FOLLOW_LINKS option.
  • Java 7 has also Files.isSymbolicLink().
file.renameTo() Files.move() Like Java 6, this method will fail to move a non empty directory if the target path is not on the same filesystem (the same FileStore).

Copying a file

In plain Java 6, you would have to do something like this:

public static void copyFile(final String from, final String to)
    throws IOException
{
    final byte[] buf = new byte[32768];
    final InputStream in = new FileInputStream(from);
    final OutputStream out = new FileOutputStream(to);
    int read;
    try {
        while ((read = in.read(buf)) != -1)
            out.write(buf, 0, read);
        out.flush();
    } finally {
        in.close();
        out.close();
    }        
}

But even this code is flawed, so if you are stuck with Java 6, do yourself a favour and use Guava, which has (since version 14.0) Closer:

public static void copyFile(final String from, final String to)
    throws IOException
{
    final Closer closer = Closer.create();
    final RandomAccessFile src, dst;
    final FileChannel in, out;

    try {
        src = Closer.register(new RandomAccessFile(from, "r");
        dst = Closer.register(new RandomAccessFile(to, "w");
        in = Closer.register(src.getChannel());
        out = Closer.register(dst.getChannel());
        in.tranfserTo(0L, in.size(), out);
    } finally {
        Closer.close();
    }
}

With Java 7, this becomes very, very simple:

public static void copyFile(final String from, final String to)
    throws IOException
{
    Files.copy(Paths.get(from), Paths.get(to));
}

Opening a BufferedWriter/OutputStream to a file

In this case, the code is not much shorter for Java 7; but you do get the benefit of better exceptions.

With Java 6:

// BufferedWriter
new BufferedWriter(new FileWriter(myFile, Charset.forName("UTF-8")));
// OutputStream
new FileOutputStream(myFile);

With Java 7:

 // BufferedWriter
Files.newBufferedWriter(myFile, StandardCharsets.UTF_8);
// OutputStream
Files.newOutputStream(myFile);

Note that only the most simple form of these methods is presented here. By adding options, you can specify whether you want to fail if the file does not exist, or create it only if it does not exist, or append to it. Many things in fact.

Listing all files in a directory

Java 6:

final File rootDir = new File(...);
for (final File file: rootDir.listFiles())
    // do something

Java 7:

final Path rootDir = Paths.get(...);
for (final Path file: Files.newDirectoryStream(rootDir))
    // do something

OK, this is not really shorter. However, the difference in behaviour alone speaks for itself:

  • if the path is not a directory, .listFiles() will happily return null; with Java 7, you get a meaningful exception (including the NotDirectoryException mentioned above);
  • as Java 7's method name says, it will be a stream; .listFiles() swallowed the whole list of files, making it impossible to use if you have a very large number of files.

To be continued...