Saturday 23 January 2010

ANSI in Delphi is not about the character set ANSI

One of the most frequent misunderstandings that I have seen about the Unicode migration of Delphi, is that many consider the Ansistring of Delphi 2007 and older, and all the Ansi functions in the APIs, to be about the ANSI character set.

If you live in USA or any other country that uses Windows-1252 (aka ANSI) as the default character set, it all fits together: ansistring contains strings using the ANSI character set. However, in the rest of the world, things are much more complicated. The default 8-bit character set in Windows is not Windows-1252 in countries like Greece, Hungary, Russia, Japan, China etc. These countries use letters that need values >=128 for their encoding, or sometimes multiple bytes. This means that:

* Document filenames inside ZIP-files probably use characters that are not shown correctly if the zip file is opened on a U.S. computer

* Uppercase() and similar string operations does not work on normal ansistring texts.

* Simple Windows text files are not compatible with PCs from countries that use other character sets.

* Ansi* functions exist, but don't use the ANSI character set

For Delphi 1-2007 developers, it has always been important to use uppercase(ansistring), lowercase(ansistring) etc. for machine-readable text (identifiers etc.), and AnsiUppercase(ansistring), AnsiLowercase(ansistring) etc. for all human text (text from TEdit etc.) in order to have an app that localizes well. AnsiUppercase will use the current local character set for its conversion, no matter whether it is Windows-1252, or not, so that Uppercase('æ') becomes 'Æ' etc. Basically, all the functions that are prefixed with "Ansi", are the locale-sensitive versions, whereas the functions without the Ansi-prefix, are useful for machine readable stuff, where it needs to be 100% deterministic and locale-independent.

This also means, that all string variables in a good app would either contain locale-independent strings, or locale-dependent strings, but not both. It was important to make this distinction in order to know, whether to use Uppercase() or AnsiUppercase() on the variable.

With Delphi 2009, Unicode is often mentioned as a localization thing, so many people struggle to get this right. However, it's still the same problem: If your app is only meant to work in USA, you can disregard all the localization stuff, and it's VERY easy. If your app was already well internationalized, the conversion to unicode is also rather easy. It only gets really complicated, if your app was not internationalized, and now you want it to be. But that's not about Unicode strings - it's about internationalization.

4 comments:

Chris said...

There is no 'misunderstanding' insofar as people use the term 'ANSI codepage' to cover 8 bit codepages as a group, since that is standard usage in a Windows context (what do you think the 'A' suffix standards for?). Moreover, even in your preferred restricted sense, it is not really correct to refer to Windows Latin-1 as 'the ANSI character set', since it differs from the finished ANSI standard whose draft it was based upon. (Not MS's fault - they wanted to ship.)

Also, you seem to be conflating Latin-1 with ASCII, e.g. when implying English does not need characters with ordinal values greater than 128. This is not true however. Consequently, using AnsiUpperCase rather than just UpperCase was always necessary *even for English* human-readable text - try calling UpperCase('café') to see what I mean.

Unknown said...

Note that Latin-1 is not the same as Windows-1252 (which is the codepage most often used when people refer to Latin-1). Latin-9 comes closer, but is still not the same. See http://en.wikipedia.org/wiki/Latin_1, and http://en.wikipedia.org/wiki/Latin_9 http://en.wikipedia.org/wiki/Windows-1252

–jeroen

OckZoNoz said...

Also note that since d2009, AnsiUpperCase works on UnicodeStrings, not AnsiStrings. This is really confusing, but was done to ease the migration of older source code.
http://www.bobswart.nl/weblog/Blog.aspx?RootId=5:3029

Lars D said...

@Chris: You are basically repeating my point. However, I'm not mixing Latin-1/ISO-8859-1 with ASCII, I actually don't even mention these two encodings.

@mignoz: Using AnsiUppercase with Unicode makes perfect sense for those that already internationalized their apps in Delphi 2007 and previous versions. All human readable texts needed Ansi* functions before to work correctly, and that did not change with UnicodeString. I cannot think of an example, where a text would previously be stored in an ansistring and use AnsiUppercase(), and would still be stored in an ansistring in Delphi 2009 even though AnsiUppercase() still aplies.