Sunday 14 June 2009

Do not set TStringList.sorted:=True with default comparison

Now, this is a nasty bug, especially for Delphi 2009, reported by an anonymous user in my other blog post here:

The Windows API CompareStr() in Windows XP SP3, using Danish locale (and probably most others), thinks that 59A < 59-A < -59-A < 5-9A < 59-A.

TStringList.Find uses binary lookups on a sorted list of strings, which means that for 1024 items in a TStringList, it does not need to make more than 10 string comparisons in order to find the index of a specific string. This is fast, but it requires the list to be sorted in a deterministic way, and it requires CompareStr() to be able to tell, what direction to go to find the string. On a Danish Windows XP SP3, this code will trigger error #1 and error #3:

var
list1: TStringList;
idx1:Integer;
begin
list1 := TStringList.create;
try
list1.Sorted := True;
list1.Add('59A');
list1.Add('-59-A');
list1.Add('5-9A');
list1.Add('59-A');
if list1.IndexOf('5-9A')=-1 then
ShowMessage ('Error #1: IndexOf does not work.');
if list1.Find('5-9A',idx1) then begin
if list1[idx1]<>'5-9A' then
ShowMessage ('Error #2: Find failed and found the wrong string.');
end else
ShowMessage ('Error #3: Find failed because it did not find the string which is present.');
finally
FreeAndNil (list1);
end;
end;


Error #1 is triggered, because .IndexOf uses .Find on sorted lists in Delphi 2009. CodeGear cannot fix this problem easily, because it's actually a bug in Windows. This bug is worse for Delphi 2009, because it optimizes the .IndexOf function by using .Find, for sorted strings.

You can make Windows Explorer demonstrate the same problem. Create 4 files like this:

Date             Size  Name  
14-06-2009 09:00 54 -59-A
14-06-2009 08:58 0 5-9A
14-06-2009 09:01 4 59-A
14-06-2009 09:02 168 59A


Then, show the folder using Windows explorer in Detailed view, and do this:

* Click the column Size, to sort by size
* Click the column Name, to sort by name.
* You can now see that the order is 5-9A, 59A, 59-A, -59-A
* Click the change date column, to sort by change date
* Click the column Name, to sort by name
* The order is now 59A, 59-A, -59-A, 5-9A

I didn't try this on 32-bit Vista or 64-Vista, yet.

14 comments:

Dag Hovden said...

Same behaviour under Vista-32/SP1. A DOS windows, using 'dir /on' gives this

59A.txt
-59-A.txt
5-9A.txt
59-A.txt

whereas the 'sort' command gives this:

-59-A.txt
5-9A.txt
59-A.txt
59A.txt

which is correct. So it seems that 'sort' does its own sorting and 'dir' is using Windows.

alex said...

I thought each locale defines own rules about this. For example 'a' > 'A' normally but in some locales 'a' < 'A'.

It is also not clear to me why is TStringList not working. Since if the same CompareStr is used for sorting and for lookup, the rules are the same and thus the search should also work.

Weird ..

Lars D said...

Binary lookup requires that there is a predefined order, so that:

if A < B < C, we can assume, that A < C

But the bug in Windows means that CompareStr does not comply with this requirement. Therefore, the default string comparison algorithm in TStringList, which uses the Windows API, is not suitable for .IndexOf() or .Find() on sorted lists.

Andreas Hausladen said...

Adding the SORT_STRINGSORT flag to the CompareString() call works for this case but might break other cases.

alex said...

if CompareStr is used for both sorting and IndexOf and it is stable (aka produces the same ordering). It should not matter to the binary search.

I've looked into TStringList and those use the same function for comparison. So the only possible explanations is that CompareStr in these cases is not stable which may mean a system bug indeed.

Anonymous said...

The discovery of the undesirable tstringlist sorting/searching behavior was made by programmers at Semaphore Corporation (semaphorecorp.com) as a result of working with US Postal Service data files. While building a tstringlist of over 160,000 maximum and minimum apartment numbers in the country, it was found that apartment 59-A appeared to be a special case: it failed to be located by .IndexOf(). Further investigation revealed the strange 59A < 59-A < -59-A < 5-9A < 59-A relation. (The USPS data file uses a leading "-" to indicate a leading zero, so -59-A stands for apartment 059-A.) It turned out 59-A wasn't really special, it just happened that the combination and order of numbers that were input made 59-A the entry that happened to reveal the weakness of CompareString(). If the apartment numbers had been loaded in a different order, a different apartment might have been unlocatable, or they all might even have ended up findable.

The Windows documentation warns that CompareString()'s default behavior is a "word sort" where hyphens and apostrophes are treated special, and that non-identical strings can be "equal" for collation. Unfortunately the documentation doesn't reveal you can end up with circular relations like 59A < 59-A < -59-A < 5-9A < 59-A.

Overall, it would probably have been better for tstringlist's implementation to use the SORT_STRINGSORT option when it (and other Ansi... routines) eventually call CompareString().

Anonymous said...

Note: This isn't limited to D2009.

TStringList.IndexOf()has used TStringList.Find on sorted lists for a long time (D6 at least, and if my memory is correct before that)

Anonymous said...

This long ago known problem for RUSSIAN code poage. In ansi version RTL I happened to to write the hook of winapi functions CompareStringA, lstrcmpA, lstrcmpiA. The table of symbols for comparisons I took from ORACLE. Bad that no possibility to adjust this in RTL.

Eric said...

Long time issue indeed.

We've been patching the RTL here for years (redirecting CompareText to a custom implementation) to avoid that problem and its variants .

Q++Studio said...

Another example is the Finnish (Finland) locale (tested under Windows 2000, probably also true for XP), which somehow decides that sorting the strings SVK and SWE will give : SWE followed by SVK. Olivier Beltrami

Lars D said...

'SWE'<'SVK' is also true for Danish locale, but that isn't a problem.

Holger Flick said...

I will be looking into this. Do you happen to have the source code for the testcase or a QC bug report?

Thanks!

Anonymous said...

Comparison uses weight a symbol.
http://unicode.org/reports/tr10/
Any one of parameter of the functions CompareString does not switch off use all of weight symbols.

fabricioaraujo_rj said...

Add Brazilian Portuguese locale to the list. Every time I sort the Dag's list, I get a different result.
Damned Windows.