Compas Pascal: March 2009

Thursday 26 March 2009

Wikipedia's effect on software marketing

Recently I saw a presentation from a very skilled Microsoft guy about Cloud computing. He was really good at presenting the message, telling about the cloud, about Microsoft products etc., but his terminology definitions definitely differed from mine.

Immediately, I looked up cloud computing on wikipedia, which corresponded very well with my own perception of the concept. Several years ago, I was in charge of a hosting center with a cloud computing system. It was basically just a hosting system, where multiple customers shared a PC, but it was able to move customers from one computer to another, when more power was necessary. It was nowhere as smart as the systems we have today, but to the customer, it was a dynamically scalable resource provided as a service over the internet. Some may say that cloud computing requires the ability to scale above what one server can deliver, but seriously, that's just a service specification, and for many users it is irrelevant.

If you look into existing corporate data centers, you can also see other kinds of "Cloud systems" - for instance, big VMware servers running a large number of virtual servers. To the departments, that buy these servers, and access them via TCP/IP, it is a style of computing that is dynamically scalable, virtualized and provided as a service over the internet protocol. The users/customers don't need to have knowledge of, expertise in, or control over the technology infrastructure in this system, that supports them. The VMware systems can move a virtual server from one physical server to another, moving the IP address etc. without interrupting online services and without closing TCP connections.

Does the term cloud computing describe the VMware system? Maybe. Or maybe not. But Google Engine or Windows Azure is not a revolution - to some people it's not even new stuff. 10 years ago, Microsoft may have hijacked the term and changed it to its own favor. Today, they can bend it, but not as much as before. Online, updated definitions of terminology, like wikipedia, make this a lot more difficult. I am sure that the definition of cloud computing will change in the near future, to specifically include systems in your own data center, but the new thing is, that the definition will be changed by nerds, using a common reference in wikipedia. It will not be changed by a single company's marketing department.

Sunday 22 March 2009

Use RAM disk for many small temporary files

Many systems use small temporary files to exchange information between two applications. There can be many reasons to do so: To support network file shares, to have a queue system that is not tied to any kind of special software, or simply to keep specifications simple. There are many well known examples of this, including mail systems like sendmail and postfix.

The problem is, that most Windows computers use the NTFS file system, which supports journaling. This means that every file that is created, actually activates the physical harddisk. On a system with a load, this can cause serious latency, which slows down any system. Unfortunately, it is not possible to turn off journaling for a single directory.

The solution? Install a RAM disk. It may take a part of your system's RAM, but it's surely extremely fast at creating and deleting files. You can get a RAM disk here. If you want to see performance numbers, see this page (use Google Translate for English version).

Thursday 12 March 2009

Mark D. Hill on Amdahl's law in the Multicore Era

This is a really cool video if you're interested in multi-core CPU architectures, performance, parallel programming or just want to know what Google is doing these days.

Wednesday 11 March 2009

UTF-8 automatic detection

If you have ever worked with an environment that mixed utf-8 and the 8-bit default character set in Windows, you may have run into the desire to autodetect utf-8 text. This is actually very easy, because there are a lot of illegal byte sequences in utf-8, which usually appear in other character sets.

For instance, a Danish word like "Øl" is encoded $D8 $6C using the Danish local character set. However, $D8 (binary 11011000) indicates that it is the start of a 2-byte sequence, where the next byte is in the range $80-$BF, which it is not. In other words, even this tiny 2-byte text can be clearly identified as not being utf-8.

Originally, the main method to autodetect utf-8 is to see if the byte sequence conforms to the utf-8 method of indicating the number of bytes in a character:

* The range $80-$BF is used for bytes which are not the first in a character
* The range $C0-$DF is used for 2-byte characters
* The range $E0-$EF is used for 3-byte characters
* The range $F0-$F7 is used for 4-byte characters
* The range $F8-$FB is used for 5-byte characters
* The range $FC-$FD is used for 6-byte characters
etc.

However, there are more mechanisms that you can use:

* 5-byte and 6-byte characters are not used, even though they would be technically possible. If you experience a valid 5-byte or 6-byte combination, which is usually unlikely, you can and may detect it as being an invalid sequence.
* It is incorrect to use more bytes than necessary. For instance, if you want to encode the character 'A' (codepoint 65=$41), it is ok to encode it using 1 byte ($41) but not ok to use 2 bytes ($C0 $41).
* If your application knows that some unicode values cannot be generated by the creator of the text, you can make an application-specific exclusion of these values, too.

One of the things that makes this autodetection so beautiful, is that it almost works with 100% accuracy. In Denmark, we use the Windows-1252, which is related to Ansi, ISO-8859-1 and ISO-8859-15:

* Byte values in the $00-$7F range can be detected as either ansi or utf-8, and it doesn't matter, because these two character encoding standards are identical for these values.
* The $80-$BF range contains a lot of strange symbols not used inside words, and the $C0-$F6 range contains almost only letters. In other words, in order to have an ansi text with a valid non-ascii utf-8 byte sequence, you would need to have a strange letter followed by the right number of symbols.
* $C0-$DF range: The most likely would be a 2-byte sequence, that starts at the end of an uppercase word with an Æ or an Ø, followed by a sign like "™", something like "TRÆ™". The last two bytes would be encoded $C6 $99 in ANSI, which is a valid utf-8 combination with the unicode value $0199. However, this is a "Latin small letter k with hook", which in most Danish applications is not a valid character. This way, this text can be rejected as being utf-8 with great certainty.
* $E0-$F7 range: Here it gets very unlikely to get the right utf-8 byte sequences, but even if it happens, the encoded value would probably end up being regarded as illegal in the specific application. Remember, many applications only accept text, that can be converted to the local 8-bit character set, because it is integrated with other systems or needs to be able to save all files in 8-bit character set text files or databases.

Tuesday 10 March 2009

Avoid deadlocks in Firebird for concurrent updates

This is a small tip that can be used when you want to have separate processes to write to the same records in the database, without deadlocks. Example:

b:=StartTransaction;
UpdateDataInTable (b);
Commit (b);

If two processes try to do this at the same time, the second process will detect a possible deadlock on the update, and will wait until the first process completes its transaction. When the first transaction commits, the second raises a deadlock exception. Fortunately, there is a simple solution:

a:=StartTransaction;
UpdateDummyData (a);
b:=StartTransaction;
UpdateDataInTable (b);
Commit (b);
Rollback (a);

If two processes do this at the same time, the a transaction will make the second process wait until the first one completes. Because a is rolled back, this is purely a wait, and will not throw any exception. This way, the two b transactions will not be active at the same time.

Sunday 8 March 2009

Latency is increasing

Once internet connection latencies were above 100ms. ISDN brought it down (for me) to about 30ms, and ADSL got it below 10ms. Now, ADSL2 is becoming more widely deployed, introducing 20ms latency, and more and more homes are being equipped with 3G/wifi gateways, eliminating the copper wire, and bringing latency above 100ms again. I guess it is obvious, that this seriously impacts how to create applications.

Even worse, I have seen 10Gigabit/sec WAN connections with 10-20ms latency, which replace local <1ms ethernet networks when datacenters are consolidated. This seriously impacts the number of network requests your application can do.

This also affects user experience in a negative way - so why is latency increasing? My guess is, that new technologies and increased demand for skilled IT workers make low latency have a cost. The solution is to specify latency requirements, and to put numbers on the costs of high latency.

Saturday 7 March 2009

The coming bloat explosion

I see a future of bloat:

* Cloud systems and virtualization on very powerful servers can be exploited to shorten the time to market, at the cost of bloat.

* Windows 7 may finally make Windows XP users upgrade, many to 64-bit machines, enabling a significant growth in amount of RAM in an average PC. This will make it more acceptable that software uses more RAM.

During the last 28 years, my experience has been, that user interfaces have had similar response time all the way. Sometimes a bit slower, sometimes a bit faster. I guess the amount of bloat always adapts itself, to a compromise between user requirements and development costs. There are currently plenty of mechanisms that favor increased bloat:

* Many multicore/multitasking techniques involve more layers, more complexity, more redundant/copied/cached data, more bloat.

* The world of programming gets more complex, and young programmers increasingly use very complex building blocks without fully realizing how they work.

* One of the most common ways to exploit multi-core CPUs, is to do work in a background thread.

Basically, there will be more bloat, because it pays. However, it also means that the amount of optimization, that can be done to code, will increase. One of the few mechanisms to reduce bloat is battery power, because increased battery lifetime is a sales parameter that justifies increased development costs. The problem here is, how do you create a mobile platform, where the 3rd party applications use less battery? Applications like JoikuSpot can quickly drain a good battery. It seems that Apple has done a good job with the iPhone, where you can use a large number of applications easily without losing all your battery power. However, Apple's methods also have drawbacks, as described in this post.

I wouldn't wonder if we will see power consumption measurement for applications built into mobile platforms in the future. Imagine a message like "This application has used 10% of your battery power during the last 15 minutes. If it continues to run, your phone will be out of power in about 60 minutes. Do you want to terminate this application?"

Thursday 5 March 2009

Environment-adapted software development

Normally, when you create software, there are 4 parameters that you can specify:

* Functionality
* Price
* Deadline
* Quality

In a competitive market, the customer can only specify three of these parameters. If the functionality is defined, the deadline is defined, the quality is defined, then the price can go up sharply, and I guess everybody knows examples of this.

As Microsoft themselves described once, the trick is to be flexible, the official advice was to cut functionality if the deadline approaches more quickly than you expected. However, you get much better planning if you start by realizing that you need flexibility, before signing a contract. Maybe you even have a fixed budget and just want to maximize the value of the product?

One of the custom projects, that we did recently for a customer, was to create a piece of software where the specs were a bit unclear, but the purpose was clear. Quality was well defined, deadline was defined, too, budget was defined, but functionality was not. The result was an innovative process: We created new ideas, and the end result was better than expected.