I wrote the text below on another blog about parallelism, 4 months ago. It is a hypothetical example, but based on a true story, with a very good point. I guess we all know the feeling of clicking the start button in Windows after booting, and nothing happens for several seconds...
"You want data to be sorted. You may start out with a standard sort algorithm. Suddenly you find, that it is slow. You find out, that you can increase the speed 5 times by replacing quicksort with mergesort. Then, you find out, that you can increase the speed 10 times more by using an implementation, that knows how to use the CPU cache well. Then, you realize, that Mergesort can use more CPUs, so you use more CPUs. Suddenly, your application is 100 times slower, because the writes to memory from one CPU is flushing the cache from the other CPU. This didn't happen on your development system, because the dual-CPU system was made in a different way... so you realize that you need to make the parallel programming differently.
Finally, you find that you need to separate the lists, that you merge in your mergesort algorithm, better. After a while, your algorithm works well on your development system and on the target system.
Finally satisfied, you put the parallel mergesort algorithm into all places where it makes sense. After a short while, the test department comes back, and reports that the application runs slower when your multithreaded sort algorithm is used. The reason is, that it now uses multiple CPU caches, leaving less cache memory for other parts of the application, slowing down these parts more, than the speed benefit of using multiple CPUs for sorting.
You are finally asked to remove the parallel processing from your mergesort in order to increase the speed of the system."