Monday 2 June 2008

Multi-core CPUs are the result of many years of parallelization

Many blogs are currently discussing, whether multi-core CPUs are really necessary, or if other measures could be possible.

They seem to forget that parallelization has been going on for many years. It's not just the separation of the GPU from the CPU, and the math coprocessor, but we got DMA (unloading memory transfers from the CPU so that it can do other things), more bits per operation (4 bit, 8 bit, 16 bit, 32 bit, 64 bit), hyperthreading, parallelization of access to RAM, CPU instruction pipelines with instruction reordering in order to make all parts of the CPU work at the same time, etc. Even harddisks have become parallelized, featuring tagged command queueing that makes parallel operations faster. And every time you parallelize something, it usually means more transistors or more latency. Latency kills performance, which means that the solutions, that get implemented, are usually performance trade-offs.

It is many years ago, that an integer multiplication took 76 clock cycles. Yes, it did. I don't know how fast CPUs are today, but my guess is, that it doesn't take more than 1 clock cycle today. Doing a floating point division was at one time very fast - except that you needed to move the numbers to the math coprocessor before it could execute the division. Increased speed but increased latency.

When you compare performance differences between 1998 and 2008, you will notice that parallelizable operations like graphics, sound, huge data amounts etc. have improved a lot in speed. If the GPU can offload the CPU, the speed increase is huge. However, some things have not improved as much. It still takes 10ms for a hard disk to move the head, and if you have a 2008 dual-CPU machine where both CPUs write to the same RAM area, performance is usually slower than a 1998 single-CPU machine.

Most of the "easy" optimizations, that did not involve cooperation with programmers, are now fully exploited, and now we need the cooperation of programmers to achieve the next levels of performance. There are three options: Not exploiting parallelism, doing it yourself, using a platform that delivers parallelism to you without great efforts. Java is trying to do its part, and so will many other frameworks. But big improvements require programmers that understand parallelism, and can create it.

Is this significantly different from 1998? No. Good programmers have always created faster applications than less good programmers.

2 comments:

Lars Fosdal said...

I think the "rediscovery" of multi-CPU on the desktop happened because they had a hard time trying to scale up the single CPU performance, and with the desktop operating systems evolving into multitasking / multithreading systems, it was a logical step to provide multiple CPUs to split these tasks on.

Add the hypervisors and virtualized environments, and we are back to the old mainframes - but now on our desktop.

Things seems to move in circles, don't they...

As you say, the next step is improving the language support for doing parallel software, and educate ourselves in how to efficiently solve our software challenges in an environment where you have n CPUs available to do your work.

Anonymous said...

Many-core CPUs come for two reasons: we're getting close to some physical (thus unbreakable) limits, and we now want CPUs, which do not consume too much energy. Thing is: energy consumption increases with CPU clock frequency, but not in a linear way. Basically, if a CPU with a clock freq. of 100 units consumes 100 units of energy, the same CPU with a clock freq of 115 units consumes 170 units of energy. 70% more energy, but it goes only 15% faster.

How to significantly increase performance without getting the energy consumption explode? Intel's answer is: duplicate the CPU cores. Of course, if a 100-unit-clock CPU consumes 100, two of these CPUs consume 200. But the overclocking rule above also applies to underclocking: if a 100-unit-clock CPU consumes 100, the same 87-unit-clock CPU only consumes 58 (1/1.15=0.87 and 1/1.70=0.58). And two 87-unit-clock CPUs only consume about 115. So by underclocking a bit and doubling the number of cores, you can get an up-to-75% faster CPU demanding only 15% more energy.

At Intel they know a lot about parallelization. They only forgot that most programmers do not really know how to efficiently and safely parallelize their programs (simply because no one taught them to!). And the most used programming languages are rather bad at expressing parallelism (but this might be changing rather quickly!). Bad news are that this always-less-energy probably is unavoidable in the (even near) future, so programmers will have but one choice...