With Java and .net, garbage collection (GC) has won a lot of followers. GC basically makes it possible to avoid some of the memory leak problems that often happen with unskilled programmers, and at the same time, it supports the current typical hardware configurations very well. When memory is needed, you don't need to search for a block, it's just grabbed.
However, since the current hardware configuration is doomed, the question is, how will the future of memory management look like? With 128 CPUs, do not expect them all to access the same RAM location equally fast.
This introduces a new concept: Memory pointers that have CPU preferences. As all other mechanisms that we have, this will be tweaked, too. Somebody will want to allocate a specific amount of memory, with a specific CPU preference, and it will be implemented.
For instance, one CPU (A) has a lot of encrypted data that needs to be decrypted before it is written serially to I/O. If another CPU (B) will write to I/O, CPU B will most likely have the I/O buffer in its RAM. In order to reduce RAM usage, the decrypting CPUs (C) would optimally save their data directly into CPU B's RAM. This can be done in multiple ways. They can save parts of it in their own RAM, and then copy that to CPU B, or they can pipe it directly to CPU B, which then saves it locally.
The piping mechanism is already implemented in hardware in several CPU architectures today - if CPU C accesses the RAM of CPU B, it writes to the RAM through CPU B, totally transparent to the programmer. In order to achieve this, the destination RAM must be allocated with preference for CPU B. If CPU C needs to allocate memory in CPU B's RAM, we have several problems:
1) Who makes sure that we don't allocate too much of CPU B's RAM? And if it happens, what should fail?
2) How does CPU C specify that the RAM should be allocated for CPU B? Using a thread ID? That may require the ability to lock a thread to a CPU.
3) How do we debug and profile this?
4) Will intra-CPU pipes be packetized, and what will the packet size be?
5) Will intra-CPU pipes be compromises between latency and bandwidth, or do we, as programmers, need to specify parameters for tweaking them?
I am quite sure that there is plenty of research going on in these topics, but from a commercial programmer's point of view, the mission is clear: We need debugging tools and programming language support. It must be very, very easy to specify how RAM is used, who owns it, CPU preferences, its lifetime etc. Since more and more RAM is used for caching, we also need support for making cache memory allocation, which can be profiled and deallocated by the OS. We need to be able to use all the available RAM for caching, cleverly split between all processes.
We need to put the programmer back in charge of memory management, and it needs to be easy.