Monday 15 October 2007

Which character set?

There are many reasons to choose a particular character set. Some believe that choosing widestring everywhere solves all problems. It doesn't. Widestring is slow because it needs to be Microsoft BSTR compatible, and it's also complicated. A widestring can contain UCS-2 encoding and UTF-16 encoding, and with UTF-16, you can have 4 bytes per character.

One of the problems is the Unicode standard. It allows a character to be built in more than one way. There is no one-to-one match between the binary representation and the look of a character. Therefore, if you want to do Unicode, text handling becomes complicated, no matter what you do.

A good way to choose a character set, is by performance and compatibility. How much space does it use, and is it compatible with other software systems. UTF-8 uses very little space if most characters are ASCII characters. If you add some ISO-8859-1 characters, like in many west european languages, it's still very compact. It only gets less optimal than other character sets, if you want to encode chinese, japanese and other languages that don't use ASCII characters a lot.

UTF-8 is also very compatible with other systems. It's a defacto standard for XML files, it only exists in one version (unlike UTF-16 with exists as UTF-16BE and UTF-16LE), and it encodes 31 bit, much more than UTF-16's 20 bit. UTF-8 is also compatible with zip filename encoding (unlike UTF-16 and UCS-2 which is not), and UTF-8 texts can be handled by many applications that were not originally designed to do so.

Linux already installs with UTF-8 as default, for most distributions and locales. This makes it possible to zip files in Moscow, send them to Copenhagen, unzip them, and all filenames are preserved. This doesn't work on Windows.

Delphi, being a Windows tool, uses Windows 8-bit and 16-bit character sets by default, in ansistrings and widestrings. There's also an utf8string, but it's actually the same as an ansistring. You can convert from widestring to utf8 and back using utf8encode() and utf8decode().

If you store and transmit unicode information using utf-8, most of you will experience a reduction in space usage and a reduction in transmission time.

One of the very nice features of utf-8 is the ability to be autodetected. Utf-8 does not allow all possible bit combinations, and the bit combinations that are being used, are usually extremely unlikely in other 8-bit encodings. For 99% of all applications, it is safe to apply autodetection to utf-8.

No comments: