Monday 5 May 2008

Goodbye UTF-16

Google's data seem to indicate that UTF-8 is about to take over the WWW. I guess most Linux distributions already made the switch to UTF-8, making us-ascii, iso-8859-1, UCS-2, UTF-16 and others seem like technology of the past. I wonder who is still using these? I do. I'm a Windows programmer. :-(


Xepol said...

What seems to be clear from the chart is that people who previously posted ascii pages have just marked them as utf-8, since for ascii (and ascii chrs 0-127 only), the two are interchangable.

Does it mean utf-8 has suddenly seen a huge surge in support? I doubt it. More likely, it repesents a surge in the number of people putting utf-8 in their html headers for english ascii pages instead.

Lars D said...

I know lots of website providers that have gone utf-8 in order to service customers from all kinds of places. Blogger, for instance, uses utf-8 by default, and even local website providers like use utf-8.

Even Firebird does not support other Unicode encodings than UTF-8 - UTF-16 is simply not a choice.

And I know of no website that uses UTF-16, like Win32, .net and Java.

Patrick said...

Erhm? The GoogleBlog didn't really mention the amount of UTF-16 usage in their measurements, so you can't determine if it's use is declining!

Also, clasifying UTF-16 as a technology of the past is a bit strange, given that it's just another encoding of the same Unicode codepoints that UTF-8 represents. (USC-2 and lots of Ansi codepages are going downhill, I give you that.)

And being a Windows programmer implies that all the API's that you call are internally converted from Ansi to UTF-16 (where ever that's applicable ofcourse). Maybe your toolkit only supports Ansi codepages, but won't change the fact that Windows is inheritly a Unicode-enabled OS.

So, "Goodbye UTF-16" is a bit harsh...

Anonymous said...

You have missed the point. This is another point for inappropriate use of the statistic. The original author had made no such statements, on the contrary…
Graph does not show that UTF-16 goes away. It is only proves that people are more and more following the standards.
Only people who would be affected/use UTF-16 are one in colors other then red and blue, and graph shows that it stays almost on the same level.
It also shows that more and more businesses and personal pages are globalized by switching to English and/or by adding alternative pages in English.
To perform validation on the site one most likely will put UTF-8 for English based pages.
People who are using commercial software like Dreamweaver or any web-based generators would also almost automatically enforced to use “common” code page.

Lars D said...

@Patrick: I consider UTF-16 really annoying. It exists as UTF-16LE and UTF-16BE, which means that you need to convert when exporting/importing data, and it's a variable-length character encoding, using 2 or 4 bytes per Unicode code point.

UTF-8 is also a variable-length character encoding, but at least it's universal and only exists in one version. Also, it's ascii compatible.

I can give you one example of where Windows fails to do things well: Since zip files do not support utf-16 in filenames, Windows cannot save filenames with non-ansi characters correctly to zip files - unlike Linux.

Also, there are lots of non-Unicode apps out there on Windows, and if you want to ensure compatibility with those, don't feed them with filenames or data that requires unicode capabilities.

Besides that, we still regularly encounter DOS character sets when integrating with other Win32 apps.

I cannot think of any Linux app that would reject Unicode data. Even Kylix 1.0 worked with UTF-8, even though it wasn't designed to do so.