Wednesday 26 March 2008

Don't use TStringList for machine-readable text

TStringList is one of the most used classes in Delphi. It is very convenient for storing strings for the user (TMemo.Lines), storing parameters for components (TIBDatabase.Params), objects and many other items.

However, there are several problems with it. It's slow when sorting data (it uses Win32 API for comparing strings), but the dangerous part is that sorting and indexing is localized. This means that this code fails on my computer, but works on an American PC:

sl:=TStringList.Create;
sl.Sorted:=True;
sl.Add ('AA');
sl.Add ('AB');
Assert (sl.Strings[0]='AA');

The reason is simple. This is the Danish alphabet:

ABCDEFGHIKLMNOPQRSTUVXYZÆØÅ

By tradition, the last letter Å can also be written AA, and you can see how these two ways of spelling are mixed well on the homepage of the city of Århus. The correct sorter in Danish language is therefore:

Aachen
Aalto
Berlin
Copenhagen
Dresden
Essen
Frederikshavn
Aabenraa
Aalborg
Aarhus

In the first two words, AA means A and then A. In the last three words, AA means Å, which is the last letter in the alphabet. However, Windows doesn't know when AA means Å and when it means A A, so it always assumes that AA means Å, and always puts AA last.

Let's assume that you want to use a TStringList to save some kinds of codes in a specific order, like ATC codes. The first codes are:

A01AA01 Sodium fluoride
A01AA02 Sodium monofluorophosphate
A01AA03 Olaflur
A01AA04 Stannous fluoride
A01AA30 Combinations
A01AA51 Sodium fluoride, combinations
A01AB02 Hydrogen peroxide
A01AB03 Chlorhexidine
A01AB04 Amphotericin B
A01AB05 Polynoxylin

This is the Danish TStringList (and Windows) sort order:

A01AB04 Amphotericin B
A01AC02 Dexamethasone
A01AA30 Combinations
A02AB03 Aluminium phosphate
A02BA05 Niperotidine
A02AA05 Magnesium silicate

If you want to avoid that, then don't use TStringList.

19 comments:

unused said...

Very interesting tip. I would not have expected TStringList to sort AA after AB depending on the localization. Although it would appear there is an easy workaround of using the CustomSort method and performing a non-localized compare. Either that, or just don't sort.

Unknown said...

That is all correct, yes. However there are other reasons to maintain a sorted TStringList. If the string list is sorted, then the Find() method will perform a binary search for a string which is orders of magnitude faster than a strict linear search. In that case it matters no *how* the list is sorted but rather that it *is* sorted. Also when searching the list, as long as the same comparison routing is used as when the list was sorted, then you will find the string.

So how would a computer know that some sequences of 'AA' are just that, 'A' followed by 'A' and others are the same as the 'Å' character? Would the sort algorithm have to keep a dictionary of words? Are there certain letter patterns that that indicate this?

Allen.

Anonymous said...

Are you using Windows Vista?

We Germans have the same problem. Our applications started to fail when they were run under Vista. They are perfectly working under Windows 9x/ME, NT4, W2k, XP.

Lars D said...

It is the same behavior on all Windows versions.

Anonymous said...

Hi, I entered a very closely related bug into QC a few years ago, which is closed and marked as designed:

http://qc.codegear.com/wc/qcmain.aspx?d=7676

Anonymous said...

Um, OK, so TStringList suffers this "problem" (some might see it reflecting locale as a desirable behaviour).

But you don't explain how you address it.

In your example list of names (?) you appear to be providing a set of strings which contain different usages of the same character sequence, i.e.:

'A' 'a' - meaning 'Aa'
and 'A' 'a' - meaning that other thing (I.m sure it has a name). :)


How is *any* algorithm supposed to discriminate between these two usages?

How do you discriminate in order to avoid Windows sort behaviour?

The only way I can see is if you have some dictionary attached to the algorithm that identifies the usages of such ambiguous sequences, in which case performance concerns w.r.t TStringList would surely appear trivial compared with the overhead of using such a dictionary.


Surely the only way around this is to not use ambiguous character sequences and to stick to Unicode.

At which point, again, issues with TStringList are surely moot until such time that it (or something else) supports Unicode.



If reading between the lines, I think the (unexpected) sensitivity to Windows locale is the issue, rather than the incorrect interpretation of potentially ambiguous character sequences.

i.e. you want 'AA' to sort "ahead of" 'AB', irrespective of locale.


If so, then then there is no reason to criticise TStringList so harshly and certainly no reason to stop using it.

At least, not completely.


TStringList only uses Windows' sort ordering because the Sort method employs a string comparison function that uses Windows' sort ordering.

You could do a locale independent (i.e. ASCII) sort quite easily by calling CustomSort() with your own ASCII comparison function:

StringList.CustomSort(ASCIICompare);

Using whatever ultra-efficient ASCII string comparison mechanism you prefer in that ASCIICompare() function.

And if you need this frequently enough, subclass TStringList and replace the Sort method so you don't need to keep cumbersomely calling CustomSort().


Or better yet, subclass TStringList and make the Sort method behaviour subject to some property:

StringList.SortMethod := smASCII;
StringList.Sort;

StringList.SortMethod := smANSI;
StringList.Sort;

etc.

Anonymous said...

I think the most common use of a sorted TStringList is to put some strings in it, often with a TObject attached, and then later retrieve individual strings and TObjects using IndexOf. For the vast majority of this kind of use, the sort order used is irrelevant. What is much more relevant, of course, is that IndexOf can find the string if it is in list. Today this can not be guaranteed because TStringlist allways uses CompareString with LOCALE_USER_DEFAULT.

TStringList should get a new proprty that can make CompareString use LOCALE_INVARIANT when needed. It should default to using LOCALE_USER_DEFAULT, even if it's really wrong for most purposes, so as not to break existing code.

Anonymous said...

Per haps a solution would be nicwe .. here you go ...

type
TCompareStrings = function(const S1, S2: string): Integer of object;

TSortKind = (skLocale, skAlphaNumeric);
TStringList = class(Classes.TStringList)
private
FLocale: Integer;
FSortKind: TSortKind;
FOnCompareStrings: TCompareStrings;
procedure SetLocale(const Value: Integer);
procedure SetSortKind(const Value: TSortKind);
procedure SetOnCompareStrings(const Value: TCompareStrings);
protected
function CompareStrings(const S1, S2: string): Integer; override;
function CompareStringsLocale(const S1, S2: string): Integer;
function CompareStringsAlphaNumeric(const S1, S2: string): Integer;
public
constructor Create;
property Locale: Integer read FLocale write SetLocale;
property SortKind: TSortKind read FSortKind write SetSortKind;
property OnCompareStrings: TCompareStrings read FOnCompareStrings;
end;


{ TStringList }

function TStringList.CompareStrings(const S1, S2: string): Integer;
begin
Result := FOnCompareStrings(S1, S2);
end;

function TStringList.CompareStringsAlphaNumeric(const S1, S2: string): Integer;
begin
Result := CompareStr(S1, S2);
end;

function TStringList.CompareStringsLocale(const S1, S2: string): Integer;
begin
Result := CompareString(FLocale, 0, Pointer(S1), Length(S1), Pointer(S2), Length(S2)) - 2;
end;

constructor TStringList.Create;
begin
inherited;
Locale := LOCALE_USER_DEFAULT;
SortKind := skAlphaNumeric;
end;

procedure TStringList.SetLocale(const Value: Integer);
begin
FLocale := Value;
end;

procedure TStringList.SetOnCompareStrings(const Value: TCompareStrings);
begin
FOnCompareStrings := Value;
end;

procedure TStringList.SetSortKind(const Value: TSortKind);
begin
FSortKind := Value;

if FSortKind = skLocale then
FOnCompareStrings := CompareStringsLocale
else
FOnCompareStrings := CompareStringsAlphaNumeric;

end;


and here is an example of use ...

procedure TForm38.FormCreate(Sender: TObject);
var
StringList: TStringList;
begin
StringList := TStringList.Create;
StringList.SortKind := skAlphaNumeric;
StringList.Sorted := True;
StringList.Add('A01AA01 Sodium fluoride');
StringList.Add('A01AA02 Sodium monofluorophosphate');
StringList.Add('A01AA03 Olaflur');
StringList.Add('A01AA04 Stannous fluoride');
StringList.Add('A01AA30 Combinations');
StringList.Add('A01AA51 Sodium fluoride, combinations');
StringList.Add('A01AB02 Hydrogen peroxide');
StringList.Add('A01AB03 Chlorhexidine');
StringList.Add('A01AB04 Amphotericin B');
StringList.Add('A01AB05 Polynoxylin');
Memo1.Lines.Assign(StringList);
end;

Lars D said...

In our source code we created a TFastAnsiStringList. It is characterized by:

1) Guarantee of always sorting the binary way.

2) IndexOf uses binary lookup when the list is sorted. This may also happen if .sorted=false, if the items were added to the list in a sorted order.

3) It is guaranteed to be ansistring when Tiburon introduces string=UnicodeString, which also means that it doesn't derive from TStrings, which will go UnicodeString

It has introduced significant speed increases in some parts of our app, and it's a safe choice for non-human text.

Anonymous said...

"2) IndexOf uses binary lookup when the list is sorted. This may also happen if .sorted=false, if the items were added to the list in a sorted order."


Interesting. I think I have an idea how you achieved this when sorted is FALSE (thinks: cost of testing whether something has been added - or even inserted - "in order" is likely to be cheaper than testing that ALL items are in order. A LOT cheaper).

I wonder then though if having "Sorted" as a write-able property is needed at all (perhaps it isn't in your case, perhaps what I'm about to suggest is already in place in your implementation...).

i.e. Sorted would be an indicator only. An empty list would be "Sorted = TRUE" and that would only flip to FALSE if an item was Added (or Inserted) out of sequence.

Of course, once it has flipped FALSE it could be reset by calling "Sort".

Now that that thought has occured to me I really like it, rather than having two potentially contradictory interfaces:

sl.Sorted := FALSE;
:
sl.Sort;

// sl.Sorted still says "FALSE"

To complete things you would probably want an "AddSorted()" method for times when you want to build a sorted list from out-of-order inputs (although ime this is actually rarer than we might expect and the performance gains not usually worth worrying about - vs adding unsorted then sorting the entire list in one final coup de grace)

:)



Your final point about being protected from the Unicodification of TStrings/TStringList is a curious one, for me.

I'm getting a lot of flack for thinking that the unilateral and universal nature of the Unicode change in Tiburon (as currently described by CodeGear) is going to catch a lot of people out and potentially do harm to Delphi.

A lot of people seem to think it's going to be a huge non-event and nothing to be getting worried about it. So I would be interested to know why you feel that this "protection" against Unicode TStrings is important.

Lars D said...

The binary lookup even when sorted=false is simple: When you use .Add or .AddObject, our compare the added string with the last string in the list. If the list was previously in order, and the new string is >= the previously largest string, then it is still in order. Very simple.

I assume that the costs for doing that is very low, because the string that was last added is most likely already in the CPU cache, and if you add a large number of strings, the last string is most likely in the cache, too. The comparing of these two strings will therefore most likely be extremely fast.

In the case where you are importing data from somewhere, it means that indexof will perform normally if the data is unsorted, but if the data is pre-sorted, which has a non-zero chance, indexof performance increases dramatically.

If the data, you're importing, is not sorted, the auto-sort detection system very quickly finds out and then stops comparing the added string to the last string.

However, setting .sorted:=True still makes sense, because it enforces correct placement of all new strings. So it is different from the auto-sort detection system.

Lars D said...

I think the Unicode transition will be relatively smooth, and I surely look forward to it. However, sometimes you want your strings to:

1) Occupy as little memory as possible, and ansistrings will use 50% of the space of UTF-16 encoded strings. Also, memory=speed, so using less memory means faster execution.

2) Be pansichar() compatible for some APIs

I expect our TFastAnsiStringlist to be 100% compatible with the Unicode improvements in Delphi, especialy because we only use it to store machine readable data, usually ascii.

Anonymous said...

The following fails using Delphi 6 on XP:

var list: tstringlist;
begin
list := tstringlist.create;
list.sorted := true;
list.add('59A');
list.add('5-9A');
list.add('59-A');
list.add('-59-A');
if list.indexof('59-A') = -1 then showmessage('bug');
list.free;
end;

Anonymous said...

You would think that adding the same four values to two different sorted lists would end up generating equal list, but it doesn't (if the values are added in different orders):

var list1,list2: tstringlist; i: integer;
begin
list1 := tstringlist.create; list1.sorted := true;
list2 := tstringlist.create; list2.sorted := true;
list1.Add('59A'); list1.Add('-59-A'); list1.Add('5-9A'); list1.Add('59-A');
list2.Add('59A'); list2.Add('5-9A'); list2.Add('59-A'); list2.Add('-59-A');
for i := 0 to list1.count-1 do
if list1[i] <> list2[i] then showmessage('unequal');
list1.free;
list2.free;
end;

Anonymous said...

A default tstringlist with .Sort=true uses AnsiCompareText()
which uses CompareString(). As called, CompareString() thinks

59A < -59-A < 5-9A

Where does CompareString() insert 59-A in that ordered list?
Unfortunately, CompareString() thinks

59A < 59-A < -59-A

but also thinks

59-A > 5-9A !!!

In other words,

59A < 59-A < -59-A < 5-9A < 59-A

appears consistent when examining adjacent elements with CompareString(), yet 59-A is in two different list positions. This means you can insert the same values into two different sorted tstringlists but end up with unequal lists if the insertions weren't made in the same order, even if you re-sort! (Because comparisons will be subject not just to the values being compared, but also where they currently happen to be in the list.)

For example, insert 59A, 5-9A, 59-A, -59-A into sorted list 1.
The list becomes 59A, -59-A, 5-9A, 59-A.
.Find() or .IndexOf() will fail to locate 59-A.

Now insert the same values into sorted list 2, but insert in the order
59A, -59-A, 5-9A, 59-A.
The list becomes 59A, 59-A, -59-A, 5-9A.
.Find() and .IndexOf() can locate 59-A.

Try .Sort on either list, and the items won't change their order. So you end up with two "sorted" lists containing the same values,
yet in different order.

Lars D said...

CompareString() uses the Windows API. I am therefore very interested to know: Which Windows version are you using? (2000, XP, Vista, Win7), and whether it is 32-bit or 64-bit.

It would also be interesting to know what Delphi version you used to test it...

Lars D said...

I can reproduce the bug on Windows XPSP3 with Delphi 2009, but I can also reproduce the bug with Windows Explorer like this:

Create 4 files like this:

14-06-2009 09:00 54 -59-A
14-06-2009 08:58 0 5-9A
14-06-2009 09:01 4 59-A
14-06-2009 09:02 168 59A

Then, show the folder using Windows explorer in Detailed view, and do this:

* Click the column Size, to sort by size
* Click the column Name, to sort by name.
* You can now see that the order is 5-9A, 59A, 59-A, -59-A
* Click the change date column, to sort by change date
* Click the column Name, to sort by name
* The order is now 59A, 59-A, -59-A, 5-9A

It is sad to see that such a fundamental API call does not work correctly, for an OS version that is still sold in large numbers in the shops.

Lars D said...

I just note that the implications of this bug is much bigger than just layout of output: The .Find method asserts that a binary lookup can be made - but you cannot rely on that if the comparison does not work.

Lars D said...

I just made a new blog post about this issue.