Tuesday 25 March 2008

Multithreading in Java 7 - oh my god

I just saw this one about the new features in Java 7:


First, the MergeSort example doesn't seem to compile. Correct me, if I'm wrong, I didn't try it. Secondly, they use MergeSort as an example of how to exploit multiple CPUs for sorting. Java 7 has the nice feature, that it can now decide at runtime, how many threads should be used to solve a particular problem (see the coInvoke part).

However, there is this tricky constant, SEQUENTIAL_THRESHOLD, which is used to decide whether to enforce sequential processing or not. How do you set this value? Well, you set it at design time, even though the example was meant to show how Java adapts at runtime...

The next thing is that the whole array is passed as parameter. No matter what programming language you use, this is a bad design. If Java doesn't copy the memory, you may have 2 CPUs looking at the same RAM area. If Java has a runtime optimization that detects that 2 CPUs are looking at the same area, and decides to copy the data, it will copy too much data...

I'm not sure this example would perform better on a 4-CPU machine than on a single-CPU machine with the same CPUs...

The basic problem in all this is, that it is extremely hard to find real world examples of parallelization of algorithms that can be optimized to any kind of parallel hardware. Good multithreading must be done on a functionality level, not on the algorithm level.

Also, every time we add multithreading to code, we make it more complex. In other words, it comes at a cost. I predict that some of the future performance gains don't come from making algorithms more threaded, but from changing data structures, reducing memory footprint and simple optimizations. As the price of more performance increases, efforts will be spent where most speed can be gained at the lowest price.

Just imagine how fast Commodore 64 basic would run on a modern CPU... and how slow Vista is.


Hamlet D'Arcy said...

Why must SEQUENTIAL_THRESHOLD by set at design time? When I read the code I assumed it was a value similar to thread pool size in the Executor services... it /could/ be hardcoded, but could just as easily be a run time setting (Spring configuration, as we do) or possibly even calculated at run time using some sort of adaptive approach.

Thanks for looking at this stuff.

Lars D said...

SEQUENTIAL_THRESHOLD is an algorithm-specific parameter. As you suggest, it could be configured for that specific system, but if all algorithms on this low level would have their own settings in the application's preferences setting, there would be quite a number of things to tweak... then again, you could auto-tweak it by measuring the performance of difference values at runtime, but then Java 7 is not a big help.

Anonymous said...

I'm not sure this example would perform better on a 4-CPU machine than on a single-CPU machine with the same CPUs...

Did you read the whole article? The author posted the data for his performance tests, they shown that it performs significantly better when using multiple threads, than when using one thread... Here is the conclusion:

While the results are fairly noisy (biased by a number of factors, including GC activity), you can see that not only are the best results achieved with a pool size equal to the number of cores available (which you would expect, given that the tasks are entirely compute-bound), but that also we've achieved a speedup of 2-3x with four cores compared to one, showing that it is possible to get reasonable parallelism using high-level, portable mechanisms without tuning.

Lars D said...

My comment was about the MergeSort example, to which the article doesn't deliver any benchmarks.

Anonymous said...

While i agree and the "most speed can be gained at the lowest price"

I dislike the Comparison between Vista and Commodore 64 basic.

Features do come at a certain cost. And one of Microsofts biggest costs is legacy/compatibility.

Commodore 64 basic maybe feautre rich and sufficent for its time but for sure didn't support that much legacy Software as Vista have to.

And there we are again at your starting argument:
"most speed can be gained at the lowest price"

If you have so much functionality already coded whats more efficent to create legacy Interfaces to sputtport them? Or Code them new to meet new Design Requirementts...

Lars D said...

When the software was written for Commodore 64, performance was very expensive, so a lot of work was spent at making it possible in small amounts of RAM and ROM. If you look at the assembler code in the ROM, they even saved bytes by making the same code do different things, depending on how it was called.

These kinds of optimizations are far too expensive today, except when creating embedded software or small modules that just have to be extremely fast. That's why most of software today is slow and fast CPUs are necessary to make them work.

However, because CPU speed growth slows, we start to search for performance optimization techniques, again. One technique is to add more CPUs and do multithreading. But the old techniques that were in use 20 years ago, also deliver significant performance increases.

I recently improved the speed of a very complex database algorithm from 60 minutes to 50 milliseconds. I hadn't used most of the techniques since the 1980s, but they surely worked out great.

Anonymous said...

My comment was about the MergeSort example, to which the article doesn't deliver any benchmarks.

I've benchmarked the MergeSort algo in the article, and it does perform better on multiple CPUs than one a single one. I changed a detail from the algo to run my test though, I used a Integer array instead of a primitive array, since Java's Arrays.sort method used for the sequential sort use QuickSort for primitives, but MergeSort for Object arrays.

I have 2 Dual Core CPUs (4 Cores) on my Mac-Pro. To sort an array of 1,000,000 random numbers, with a sequential threshold of 10,000 elements, it was roughly twice as fast with 4 threads, than it was for a single thread.

Lars D said...

I have written a highly optimized Mergesort for Delphi, which is 6-20 times faster than the standard Delphi Quicksort on typical data amounts, sorting the same kind of data. It supports multithreading nicely, and is faster on my Core 2 machine when multithreading is enabled. However, I have disabled the multithreading part of it, because multiple threads may make it significantly slower on some hardware.

Multithreading may make it 1.8 times faster on my computer, but if it makes it 5 times slower on another multi-CPU computer, it doesn't make sense to enable it.

Anyway, mergesort is so efficient, that the time of getting the data probably overshadows the time of sorting it. Even when my mergesort runs 20 times faster than Quicksort, the overall application performance doesn't change, except if the application is very poorly written.

Ciprian Khlud said...

I kinda disagree on comparison of Commodore with Vista too. I agree that most of developments in speed can be gained today by overcome the low-end slowdowns. But on the other side this is why today a lot of software runs better on Java in general than let's say C++. Those low level optimizations do not get much from the high level compilers compared with Java/.NET case. Today's .NET implementation runs in the same performance envelope with Delphi class in most cases, mostly on Generics/Linq code.
My point was that simply Commodore does too few today to compare with Vista. As you will apply the same algorithm, with the same inputs in a .NET or C++ implementations you will get better performance. I do have screen resolution of 2 megapixels that do not compare with a 0.1 MPixels, 3D accelerated (if you enable Aero) that have smooth requirements. Also, most of UI is double-buffered instead of direct screen access.