It is a common misunderstanding, that you can optimize a process to become optimal by optimizing subprocesses.
A good analogy is to find the highest hill by walking around in an environment with 100 meter visibility. If you always try to walk upwards, you may find a local hill. If you go downwards a bit, you may find a bigger hill. There are many good mathematical books on this topic.
When dealing with programmers, it's a bit more complex. However, every time your organization reorganizes itself, you probably experience exactly this: Initially higher costs, but in the long run costs are significantly lower. One reason is, that information is distributed.
In programming, the environment constantly changes. Your organization changes, your customers change, the technology changes. It's like having the hills change height all the time. This is where it is important to pick a hill to stand on. Pick a hill that is one of the highest, but not too far from the other hills. You will find such one easily, if you change hill frequently.
This is what a continuous improvement process is all about. It adds extra costs, but it ensures long term competitiveness.
Using Delphi can be described as being on a hill in a "Delphi" group of hills. Using Java and .net can be described in the same way. Erlang is a very small group of hills. One of the good things about Delphi is, that every time Borland/CodeGear add a new hill, it's close to the previous hills, and not too far from the Microsoft hills. This keeps the costs of continuous improvement down.
Saturday, 28 June 2008
Wednesday, 25 June 2008
Mary Poppendieck: 100% Scrum is not 100% Agile
Yesterday, we had a workshop with Mary at the IT University in Copenhagen. In case you don't know Mary, she is one of the most important people in Agile and Lean software development today. The workshop was arranged by BestBrains.
I can strongly recommend following Mary's advices. She is experienced with real life software development, with managing, and has many good points. And it's nice to hear her confirm, that being 100% Scrum is not being 100% Agile. I totally agree to that.
I can strongly recommend following Mary's advices. She is experienced with real life software development, with managing, and has many good points. And it's nice to hear her confirm, that being 100% Scrum is not being 100% Agile. I totally agree to that.
Thursday, 5 June 2008
Use more than 4GB RAM with Win32
If you need to allocate more than 2GB RAM, some of this will probably be used for caching undo-levels, caching data retrieved via the network, calculation results etc. The Windows API actually has a feature, that may help you use as much RAM as possible, even more RAM than Win32 can access, in case you're using a 64-bit operating system. The trick is to use the CreateFile function with the flags FILE_ATTRIBUTE_TEMPORARY and FILE_FLAG_DELETE_ON_CLOSE. Adding FILE_FLAG_RANDOM_ACCESS can also be beneficial.
From the Win32 documentation: "Specifying the FILE_ATTRIBUTE_TEMPORARY attribute causes file systems to avoid writing data back to mass storage if sufficient cache memory is available, because an application deletes a temporary file after a handle is closed. In that case, the system can entirely avoid writing the data. Although it doesn't directly control data caching in the same way as the previously mentioned flags, the FILE_ATTRIBUTE_TEMPORARY attribute does tell the system to hold as much as possible in the system cache without writing and therefore may be of concern for certain applications."
The FILE_FLAG_DELETE_ON_CLOSE is useful to ensure, that the file does not exist on the harddisk, when the application has stopped, or has been aborted. The Win32 file APIs support large files, so there is basically no limit to the amount of data stored this way.
Compared to AWE, this method works on all Win32 platforms, doesn't use more RAM than what makes overall sense for the PC's current use and doesn't require the application to run with "Lock Pages in Memory" privilege.
From the Win32 documentation: "Specifying the FILE_ATTRIBUTE_TEMPORARY attribute causes file systems to avoid writing data back to mass storage if sufficient cache memory is available, because an application deletes a temporary file after a handle is closed. In that case, the system can entirely avoid writing the data. Although it doesn't directly control data caching in the same way as the previously mentioned flags, the FILE_ATTRIBUTE_TEMPORARY attribute does tell the system to hold as much as possible in the system cache without writing and therefore may be of concern for certain applications."
The FILE_FLAG_DELETE_ON_CLOSE is useful to ensure, that the file does not exist on the harddisk, when the application has stopped, or has been aborted. The Win32 file APIs support large files, so there is basically no limit to the amount of data stored this way.
Compared to AWE, this method works on all Win32 platforms, doesn't use more RAM than what makes overall sense for the PC's current use and doesn't require the application to run with "Lock Pages in Memory" privilege.
Monday, 2 June 2008
Multi-core CPUs are the result of many years of parallelization
Many blogs are currently discussing, whether multi-core CPUs are really necessary, or if other measures could be possible.
They seem to forget that parallelization has been going on for many years. It's not just the separation of the GPU from the CPU, and the math coprocessor, but we got DMA (unloading memory transfers from the CPU so that it can do other things), more bits per operation (4 bit, 8 bit, 16 bit, 32 bit, 64 bit), hyperthreading, parallelization of access to RAM, CPU instruction pipelines with instruction reordering in order to make all parts of the CPU work at the same time, etc. Even harddisks have become parallelized, featuring tagged command queueing that makes parallel operations faster. And every time you parallelize something, it usually means more transistors or more latency. Latency kills performance, which means that the solutions, that get implemented, are usually performance trade-offs.
It is many years ago, that an integer multiplication took 76 clock cycles. Yes, it did. I don't know how fast CPUs are today, but my guess is, that it doesn't take more than 1 clock cycle today. Doing a floating point division was at one time very fast - except that you needed to move the numbers to the math coprocessor before it could execute the division. Increased speed but increased latency.
When you compare performance differences between 1998 and 2008, you will notice that parallelizable operations like graphics, sound, huge data amounts etc. have improved a lot in speed. If the GPU can offload the CPU, the speed increase is huge. However, some things have not improved as much. It still takes 10ms for a hard disk to move the head, and if you have a 2008 dual-CPU machine where both CPUs write to the same RAM area, performance is usually slower than a 1998 single-CPU machine.
Most of the "easy" optimizations, that did not involve cooperation with programmers, are now fully exploited, and now we need the cooperation of programmers to achieve the next levels of performance. There are three options: Not exploiting parallelism, doing it yourself, using a platform that delivers parallelism to you without great efforts. Java is trying to do its part, and so will many other frameworks. But big improvements require programmers that understand parallelism, and can create it.
Is this significantly different from 1998? No. Good programmers have always created faster applications than less good programmers.
They seem to forget that parallelization has been going on for many years. It's not just the separation of the GPU from the CPU, and the math coprocessor, but we got DMA (unloading memory transfers from the CPU so that it can do other things), more bits per operation (4 bit, 8 bit, 16 bit, 32 bit, 64 bit), hyperthreading, parallelization of access to RAM, CPU instruction pipelines with instruction reordering in order to make all parts of the CPU work at the same time, etc. Even harddisks have become parallelized, featuring tagged command queueing that makes parallel operations faster. And every time you parallelize something, it usually means more transistors or more latency. Latency kills performance, which means that the solutions, that get implemented, are usually performance trade-offs.
It is many years ago, that an integer multiplication took 76 clock cycles. Yes, it did. I don't know how fast CPUs are today, but my guess is, that it doesn't take more than 1 clock cycle today. Doing a floating point division was at one time very fast - except that you needed to move the numbers to the math coprocessor before it could execute the division. Increased speed but increased latency.
When you compare performance differences between 1998 and 2008, you will notice that parallelizable operations like graphics, sound, huge data amounts etc. have improved a lot in speed. If the GPU can offload the CPU, the speed increase is huge. However, some things have not improved as much. It still takes 10ms for a hard disk to move the head, and if you have a 2008 dual-CPU machine where both CPUs write to the same RAM area, performance is usually slower than a 1998 single-CPU machine.
Most of the "easy" optimizations, that did not involve cooperation with programmers, are now fully exploited, and now we need the cooperation of programmers to achieve the next levels of performance. There are three options: Not exploiting parallelism, doing it yourself, using a platform that delivers parallelism to you without great efforts. Java is trying to do its part, and so will many other frameworks. But big improvements require programmers that understand parallelism, and can create it.
Is this significantly different from 1998? No. Good programmers have always created faster applications than less good programmers.
Subscribe to:
Posts (Atom)