The Windows API CompareStr() in Windows XP SP3, using Danish locale (and probably most others), thinks that 59A < 59-A < -59-A < 5-9A < 59-A.
TStringList.Find uses binary lookups on a sorted list of strings, which means that for 1024 items in a TStringList, it does not need to make more than 10 string comparisons in order to find the index of a specific string. This is fast, but it requires the list to be sorted in a deterministic way, and it requires CompareStr() to be able to tell, what direction to go to find the string. On a Danish Windows XP SP3, this code will trigger error #1 and error #3:
var
list1: TStringList;
idx1:Integer;
begin
list1 := TStringList.create;
try
list1.Sorted := True;
list1.Add('59A');
list1.Add('-59-A');
list1.Add('5-9A');
list1.Add('59-A');
if list1.IndexOf('5-9A')=-1 then
ShowMessage ('Error #1: IndexOf does not work.');
if list1.Find('5-9A',idx1) then begin
if list1[idx1]<>'5-9A' then
ShowMessage ('Error #2: Find failed and found the wrong string.');
end else
ShowMessage ('Error #3: Find failed because it did not find the string which is present.');
finally
FreeAndNil (list1);
end;
end;
Error #1 is triggered, because .IndexOf uses .Find on sorted lists in Delphi 2009. CodeGear cannot fix this problem easily, because it's actually a bug in Windows. This bug is worse for Delphi 2009, because it optimizes the .IndexOf function by using .Find, for sorted strings.
You can make Windows Explorer demonstrate the same problem. Create 4 files like this:
Date Size Name
14-06-2009 09:00 54 -59-A
14-06-2009 08:58 0 5-9A
14-06-2009 09:01 4 59-A
14-06-2009 09:02 168 59A
Then, show the folder using Windows explorer in Detailed view, and do this:
* Click the column Size, to sort by size
* Click the column Name, to sort by name.
* You can now see that the order is 5-9A, 59A, 59-A, -59-A
* Click the change date column, to sort by change date
* Click the column Name, to sort by name
* The order is now 59A, 59-A, -59-A, 5-9A
I didn't try this on 32-bit Vista or 64-Vista, yet.
14 comments:
Same behaviour under Vista-32/SP1. A DOS windows, using 'dir /on' gives this
59A.txt
-59-A.txt
5-9A.txt
59-A.txt
whereas the 'sort' command gives this:
-59-A.txt
5-9A.txt
59-A.txt
59A.txt
which is correct. So it seems that 'sort' does its own sorting and 'dir' is using Windows.
I thought each locale defines own rules about this. For example 'a' > 'A' normally but in some locales 'a' < 'A'.
It is also not clear to me why is TStringList not working. Since if the same CompareStr is used for sorting and for lookup, the rules are the same and thus the search should also work.
Weird ..
Binary lookup requires that there is a predefined order, so that:
if A < B < C, we can assume, that A < C
But the bug in Windows means that CompareStr does not comply with this requirement. Therefore, the default string comparison algorithm in TStringList, which uses the Windows API, is not suitable for .IndexOf() or .Find() on sorted lists.
Adding the SORT_STRINGSORT flag to the CompareString() call works for this case but might break other cases.
if CompareStr is used for both sorting and IndexOf and it is stable (aka produces the same ordering). It should not matter to the binary search.
I've looked into TStringList and those use the same function for comparison. So the only possible explanations is that CompareStr in these cases is not stable which may mean a system bug indeed.
The discovery of the undesirable tstringlist sorting/searching behavior was made by programmers at Semaphore Corporation (semaphorecorp.com) as a result of working with US Postal Service data files. While building a tstringlist of over 160,000 maximum and minimum apartment numbers in the country, it was found that apartment 59-A appeared to be a special case: it failed to be located by .IndexOf(). Further investigation revealed the strange 59A < 59-A < -59-A < 5-9A < 59-A relation. (The USPS data file uses a leading "-" to indicate a leading zero, so -59-A stands for apartment 059-A.) It turned out 59-A wasn't really special, it just happened that the combination and order of numbers that were input made 59-A the entry that happened to reveal the weakness of CompareString(). If the apartment numbers had been loaded in a different order, a different apartment might have been unlocatable, or they all might even have ended up findable.
The Windows documentation warns that CompareString()'s default behavior is a "word sort" where hyphens and apostrophes are treated special, and that non-identical strings can be "equal" for collation. Unfortunately the documentation doesn't reveal you can end up with circular relations like 59A < 59-A < -59-A < 5-9A < 59-A.
Overall, it would probably have been better for tstringlist's implementation to use the SORT_STRINGSORT option when it (and other Ansi... routines) eventually call CompareString().
Note: This isn't limited to D2009.
TStringList.IndexOf()has used TStringList.Find on sorted lists for a long time (D6 at least, and if my memory is correct before that)
This long ago known problem for RUSSIAN code poage. In ansi version RTL I happened to to write the hook of winapi functions CompareStringA, lstrcmpA, lstrcmpiA. The table of symbols for comparisons I took from ORACLE. Bad that no possibility to adjust this in RTL.
Long time issue indeed.
We've been patching the RTL here for years (redirecting CompareText to a custom implementation) to avoid that problem and its variants .
Another example is the Finnish (Finland) locale (tested under Windows 2000, probably also true for XP), which somehow decides that sorting the strings SVK and SWE will give : SWE followed by SVK. Olivier Beltrami
'SWE'<'SVK' is also true for Danish locale, but that isn't a problem.
I will be looking into this. Do you happen to have the source code for the testcase or a QC bug report?
Thanks!
Comparison uses weight a symbol.
http://unicode.org/reports/tr10/
Any one of parameter of the functions CompareString does not switch off use all of weight symbols.
Add Brazilian Portuguese locale to the list. Every time I sort the Dag's list, I get a different result.
Damned Windows.
Post a Comment