Wednesday 26 March 2008

Don't use TStringList for machine-readable text

TStringList is one of the most used classes in Delphi. It is very convenient for storing strings for the user (TMemo.Lines), storing parameters for components (TIBDatabase.Params), objects and many other items.

However, there are several problems with it. It's slow when sorting data (it uses Win32 API for comparing strings), but the dangerous part is that sorting and indexing is localized. This means that this code fails on my computer, but works on an American PC:

sl:=TStringList.Create;
sl.Sorted:=True;
sl.Add ('AA');
sl.Add ('AB');
Assert (sl.Strings[0]='AA');

The reason is simple. This is the Danish alphabet:

ABCDEFGHIKLMNOPQRSTUVXYZÆØÅ

By tradition, the last letter Å can also be written AA, and you can see how these two ways of spelling are mixed well on the homepage of the city of Århus. The correct sorter in Danish language is therefore:

Aachen
Aalto
Berlin
Copenhagen
Dresden
Essen
Frederikshavn
Aabenraa
Aalborg
Aarhus

In the first two words, AA means A and then A. In the last three words, AA means Å, which is the last letter in the alphabet. However, Windows doesn't know when AA means Å and when it means A A, so it always assumes that AA means Å, and always puts AA last.

Let's assume that you want to use a TStringList to save some kinds of codes in a specific order, like ATC codes. The first codes are:

A01AA01 Sodium fluoride
A01AA02 Sodium monofluorophosphate
A01AA03 Olaflur
A01AA04 Stannous fluoride
A01AA30 Combinations
A01AA51 Sodium fluoride, combinations
A01AB02 Hydrogen peroxide
A01AB03 Chlorhexidine
A01AB04 Amphotericin B
A01AB05 Polynoxylin

This is the Danish TStringList (and Windows) sort order:

A01AB04 Amphotericin B
A01AC02 Dexamethasone
A01AA30 Combinations
A02AB03 Aluminium phosphate
A02BA05 Niperotidine
A02AA05 Magnesium silicate

If you want to avoid that, then don't use TStringList.

Tuesday 25 March 2008

Multithreading in Java 7 - oh my god

I just saw this one about the new features in Java 7:

http://www.ibm.com/developerworks/java/library/j-jtp03048.html

First, the MergeSort example doesn't seem to compile. Correct me, if I'm wrong, I didn't try it. Secondly, they use MergeSort as an example of how to exploit multiple CPUs for sorting. Java 7 has the nice feature, that it can now decide at runtime, how many threads should be used to solve a particular problem (see the coInvoke part).

However, there is this tricky constant, SEQUENTIAL_THRESHOLD, which is used to decide whether to enforce sequential processing or not. How do you set this value? Well, you set it at design time, even though the example was meant to show how Java adapts at runtime...

The next thing is that the whole array is passed as parameter. No matter what programming language you use, this is a bad design. If Java doesn't copy the memory, you may have 2 CPUs looking at the same RAM area. If Java has a runtime optimization that detects that 2 CPUs are looking at the same area, and decides to copy the data, it will copy too much data...

I'm not sure this example would perform better on a 4-CPU machine than on a single-CPU machine with the same CPUs...

The basic problem in all this is, that it is extremely hard to find real world examples of parallelization of algorithms that can be optimized to any kind of parallel hardware. Good multithreading must be done on a functionality level, not on the algorithm level.

Also, every time we add multithreading to code, we make it more complex. In other words, it comes at a cost. I predict that some of the future performance gains don't come from making algorithms more threaded, but from changing data structures, reducing memory footprint and simple optimizations. As the price of more performance increases, efforts will be spent where most speed can be gained at the lowest price.

Just imagine how fast Commodore 64 basic would run on a modern CPU... and how slow Vista is.

Monday 10 March 2008

Never modify source code in weekends

I just released some code last saturday. What I didn't notice, was the TortoiseSVN inserted a localized date into the source code in a $LastChangedDate:$ text, even though I deliverately use it with English user interface.

In Danish, there are non-ascii characters in the weekday names for saturday (lørdag) and sunday (søndag), so this made the source code file become a non-ascii file, which basically broke it for some users. Today I rereleased the files, and because it's monday (mandag), the problem is gone :-)

I guess that was another lesson on how not to localize.

Saturday 8 March 2008

Delphi - the green choice for the environment

The internet is starting to use a significant part of the world's electrical power, and increasingly complex algorithms are driving our economy. Some blogs even have started to discuss software engineering and global warming.

As software engineers, our choices have impacts on the energy usage. Delphi/Win32 still uses 8-bit character sets, which is faster, uses less hardware and is therefore a greener choice than UTF-16 based platforms like Java or .net.

How important is this? It's not important, at all. Any improvement in a specific part of a program, that is usually irrelevant unless you improve it at least 10 times. What about rewriting your software for low energy devices? Also not a good choice, because the real environmental problem with computers is to produce them - so don't make your customers buy new computers and think you saved the world.

If you want to do something for the environment, remove your focus from technology and focus on how to solve end-user problems well. Much energy is wasted elsewhere because of bad software.