Thursday, 8 October 2009

Parallel programming requires multiple techniques, not just one

There seems to be a huge search out there for the holy grail of parallelism. Some want functional programming to be the solution, others think about other solutions. However, the scale of the problem is often ignored: Parallism is introduced on a huge number of levels, each with different solutions:

On the bit level, we can handle multiple bits at the same time. A CPU can handle 8 bits, 16 bits, 32 bits, 64 bits in one step. The more bits that are handled, the more parallism we have. However, you cannot just use 1024 bit arithmetics and get more speed, there's a limit.

On the instruction processing level, pipelines make it possible to execute multiple instructions with less time between, than the time it takes to execute one full instruction. The CPU simply divides instructions into multiple parts, and executes instruction parts in parallel, so that fewer parts of the CPU aren't running idle. However, this obviously has a limit - and I guess most readers know the tradeoff between pipeline size and speed in games.

On the CPU level, we can have multiple cores, with each their own caches etc. This makes it possible to execute two threads at the same time, although they usually access the same main RAM, and this gives a limit to parallelism... don't expect much additional performance after 8-16 cores on the same main RAM.

In the machine level, we can do NUMA architectures. Multiple CPUs do not share RAM, but can access each other's RAM with reduced performance. If we want massive parallelism, it is required that the CPUs cannot access the same RAM with the same speed. There is a performance hit when the CPUs need to exchange lots of data, so this cannot improve the speed of everything.

On the network level, we can connect CPUs that cannot look into each other's RAM. This can be a worldwide network, but it introduces even more performance hits when exchanging data.

The main focus right now seems to be on the "2-16 core on one main RAM" level. This is not fully solved using functional programming or similar techniques. The NUMA level is completely out of focus because we don't run common operating systems that allow a multithreaded application to be distributed across several CPUs without a common main RAM.

So, when searching for a good solution to the multi-core level, always remember the other levels. It's the combination of all levels that decide the final performance.

No comments: