Tuesday 27 November 2007

CodeRage impressions

Having participated on a few sessions at CodeGear's online CodeRage conference, I can only say that this is a very good way to make conferences. There are lots of advantages over a traditional conference and the costs are much lower. You can chat with other attendees during a session, asking questions, you can leave the session without disturbing if it gets too boring, and you can do other work if the topic is easy for you. On the downside, you don't get those extra days in a geographically remote location and you don't get the beers afterwards, but I guess we'll see this evolve a lot in the coming years. Unfortunately, there's also the downside that CodeRage was arranged for U.S. time, meaning that it's after business hours in Denmark.

However, I can only recommend to sign up and participate in this kind of conferences.

Wednesday 21 November 2007

Date and Time in programming

Sometimes I wonder if non-programmers know how complicated time is. Let's have a resume of the basics:
  • There are 23, 24 or 25 hours on one day (daylight saving time change days)
  • An hour has 60 minutes.
  • A minute has 60, 61 or 62 seconds (leap seconds).
  • A day has 1380, 1440 or 1500 minutes.
  • A week has 7 days.
  • A day has 82800, 86400, 86401, 86402 or 90000 seconds.
  • A month has 28, 29, 30 or 31 days. In some systems, a month is standardized to be 30 days.
  • A year has 365 or 366 days, but in some systems, it's standardized to be 360 days.
  • A year has 52 or 53 weeks.
  • Even though we have an ISO standard for weeks, end users don't agree on the starting weekday for a week.
  • Some dates don't exist, and for historical dates, the offset between different geographical regions was not about hours, but about days. The russian October Revolution actually happened in November, according to most European calendars of that time, but it was October in Russia.
  • This is all about the Christian calendar. There are other calendars out there...
Then there's local time and UTC time:
  • Local time deviates from UTC time in a number of hours, which can be fractional in some rare cases
  • For all practical purposes, GMT is the same as UTC, and GMT is not the local time of London (London uses GMT+1 in the summer).
  • UTC time offset is a function that takes the UTC time stamp and geographical location as parameter, and UTC time offset is often historically different for two different cities in the same country.
  • A time zone can be specifed using a GMT offset, just like time stamps.
  • A time zone can include several regions with different daylight saving rules.
This wouldn't be so complicated if we didn't have to make calculations based on this. Timestamps are usually stored these ways:
  • A floating point number indicating the number of days since a specific date at midnight
  • An integer or floating point number indicating the number of seconds since a specific date at midnight
  • Year, Month, Day, Hour, Minutes, Seconds as separate values
Variations of these may occur, for instance, a time stamp may be an integer number of milliseconds or even microseconds, instead of seconds, but it's still the same idea.

UTC time offsets are usually specified this way:
  • Number of hours difference between the local time and UTC, at the time of the timestamp
  • Geographic location (like 'Europe/Copenhagen')
  • Time zone (like 'CET')
Daylight saving is usually handled these ways:
  • The internal clock of the PC uses local time, and changes. If you use virtualization or dual-boot, you may risk that it changes twice, giving incorrect time.
  • The internal clock of the PC uses UTC time, and does not change.
  • Daylight saving time is usually handled using locally stored information, which may get outdated, so that the computer actually miscalculates the time by one hour.
Leap seconds are usually handled these ways:
  • They're implemented nicely, and the software needs to know about them.
  • They're not implemented, so the software doesn't need to know about them, but instead, the clock is adjusted, and the software needs to be capable of handling a clock that doesn't move for 1-2 seconds.
Clock precision:
  • Most PCs today have some kind of clock synchronization over the internet, which yields a sub-second precision. However, don't count on your clock to be 1ms precise.
  • PCs often have their clocks adjusted, so that you need to make sure that your software can survive a clock, that moves backwards.
Now, how do we calculate age? If you were born on february 29th 1980, and an election for parliament is held on february 28th 1998, are you allowed to vote? Probably not. What if something has to be done on a day, that may not be later than on your birthday? Then february 28th would be the last day. So you cannot use a GetBirthDay(BirthDate,Age) function for these two cases, since those two problems result in two different dates.

What about statistics? Here you have a lot of other problems:
  • Total numbers per month don't make sense for February, which changes it's length every 4 years.
  • Numbers per day, for a month usually don't make sense either, because the number of weekend days in a month is varying, and numbers often depend on weekdays.
Other programming problems
  • Comparing. If you have two timestamps from different sources, you need to define many things to be able to say timestamp1=timestamp2.
  • Round-off errors often mean, that timestamp1+timeinterval-timeinterval<>timestamp1. When you have deadlines, it can become very tricky to decide, if a deadline has been reached, or not.
  • Uniqueness of timestamps: Some programmers want to use timestamps as primary key. Some database systems even support that, but what happens when you transport these timestamps to other parts of your software, will they still be unique?
The worst problem is the specification of timezones and daylight saving time:
  • Many people don't understand the difference between GMT offsets for time zones and time stamps. Example: In Denmark, which is located in the GMT+1 time zone, we use GMT+2 time stamps during the summer.
  • Terms like "CET" have many meanings, depending on who you ask. It can be the time in Germany in winter, it can be the current time in Germany, and it can describe the time zone which includes France and Germany, which did not have the same daylight saving time rules 30 years ago. In case of the "CET time zone", a historical timestamp may be useless without knowing if it applies to a location in France or Germany.
Is that it? No. Here's the absolutely biggest problem: When programmers give names to variables and functions, they don't precisely describe what they do. In Delphi, there is a constant named SysUtils.MinsPerDay. It is defined as MinsPerDay=24*60. Does that make sense? For some programs, yes. For others, definitely not.

In Windows, the standard functions to convert between local time and UTC time do not handle historical timestamps well, but documentation isn't good at explaining that. On Windows, you should always make days 24 hours and 86400 seconds, and always use local time, unless you really know what you're doing.

Here are some examples of bad time related functionality:
  • DateTime.IsDaylightSavingTime Method - The text says "Indicates whether this instance of DateTime is within the Daylight Saving Time range for the current time zone.", but what if there are two different daylight saving time ranges for the time zone? The problem is, that the documentation sets time zone = daylight saving time rule set.
  • GetDynamicTimeZoneInformation - The information returned by this information can be invalid just after returning. This function basically doesn't make sense, it should have taken a time stamp as parameter.
  • TimeZone.GetUtcOffset Method - It calculates the UTC offset from the local time. However, once a year, the same local time repeats itself for one hour, with two different UTC offsets. So this method doesn't make sense.
There are many more examples out there.

Linux works very differently - it counts seconds everywhere and uses geographic locations for GMT offset calculcations. This works extremely well, and Linux only converts to year/month/day representations when interacting with the user. However, even though linux can support leap seconds, most apps probably won't work well if you enable it.

This blog post doesn't cover all the kinds of trouble we programmers face, there's much more. My advice is to try to keep things simple and prepare for the worst.

Friday 16 November 2007

How to get girls into programming

The Sun tools team have blogged about the perils of abstraction. They say something like "we need to stop thinking abstract everything".

I don't know where they got this from - not all programmers try to abstract everything. Some programmers hate abstraction, and love the detail, and no, they are not unintelligent. In fact, some of these guys and girls can be brilliant programmers, creating much more user friendly applications that users love.

As every psychologist will know, humans' brains are not wired the same way. We have strong preferences for ways of thinking, and the same information is not handled the same way in different brains. If you could have two identical people with different brain wirings but the same knowledge, and you put them into exactly the same situation, they would extract different knowledge from that situation.

The masterminds behind software architecture often favor abstract thinking over details. They are good at spotting abstract information, creating abstract knowledge from experiences etc., but they usually don't put much value into minor details, like "it looks ugly" or "that's not what the customer said". If you put 5 abstract-thinking people together in a team, you will get a result that is abstract and possibly horrible.

If you want a well designed product, architecture, specs etc., you need to involve people with different brain skills. Psychologists say, that our sexes have different brain skills (T/F) that relate to exactly this problem.

I believe the biggest problem in IT is the lack of product quality, and not the lack of girls. But I do believe that these two problems are closely related, and solved using the same management techniques.

Thursday 15 November 2007

Flash RAM instead of harddisks

Flash drives have become larger and less expensive, and it doesn't take a lot of experimenting to find out, that a laptop can become faster, quieter, more robust and have a longer battery life, if you replace your harddisk with flash RAM. And then there's the fact, that good quality flash RAM systems outlive even very expensive harddisks easily.

What do flash drives mean for software developers? Here are a couple of consequences:
  • When multiple threads compete for disk access, responsiveness will benefit greatly. This means that background threads that access disks will be more likely.
  • Disk space becomes more expensive for a while, favoring apps that don't waste space too much.
  • It becomes less necessary to prefetch small amounts of data from the local disk. For some applications, this can reduce RAM usage.
  • Reduced seek time means that different file formats may become optimal. This includes different ways of indexing, but it may also mean less redundancy in file formats.
Because of the huge benefits of flash RAM, less disk space will be considered acceptable, and often, this makes it realistic to have more RAM in the PC, than there is flash RAM. This makes it obvious to cache everything on the flash - and it can be cached by the OS file system or by the application.

Friday 9 November 2007

The price of using GUIDs in databases

There has been some discussion about the use of GUIDs lately. A GUID is a 128-bit integer that is picked randomly, and that is obviously a good thing, if your database needs more than 264=18×1018 records, but because it is 128 bit, you can be quite sure that this random number has not already been used somewhere else. The difference between an autoincrementing 128 bit integer and GUID is, that GUID values are always picked randomly.

It makes sense to apply GUIDs when:
  • No specific order is required
  • 128 bit is not considered a waste of space
  • A very small chance of not succeeding to pick a unique number is ok
  • Values cannot be produced in one place, or having no specific order is a feature
Microsoft recommends to use GUIDs as primary key because it enables replication between different databases. When you do that, the chance of having conflicts is very small - for instance, two databases with each 1 billion records, can have these merged easily, and the chance of primary key conflict is only 109×(109/1038)=10-20.

However, it comes at a price. There is a big chance, that two records, that are added shortly after each other, are related. For instance, if you want to save an invoice, there may be 5 records that describe items on that invoice, which are added as part of the same transaction. If a database server uses autoincrementing integer values as primary key, and fully or partially physically sorts records by this primary key, these 5 records will probably go into 1 or 2 places on the harddisk. If GUIDs were used, they would be stored in 5 different places on the harddisk. This is one of the reasons why GUID-based databases are usually on servers that have more RAM than they have data - they need to cache everything.

Another price comes when debugging. You need more IQ to debug code than to write code, so it is important that you optimize for debugging. It must be easy to see, that the data stored in the database is correct. GUIDs are not always the easiest key to read, especially not in developer databases, that tend to have very little data, and therefore very small numbers in autoincrementing integer fields.

Friday 2 November 2007

Best Practice in Software Development

IBM has a nice page on Best Practice in software development. It's amazing what such a page doesn't list. For instance, UML is the only method listed for design, even though there are alternatives and UML has known caveats.

It also mentions "Keep it simple" and "Information hiding" as some of the most important principles. I totally disagree. I consider "Make complex things easy to use" as the most important principle. It is ok for things to be complex, and it is ok not to hide information, but it is unforgivable to create something that is too complicated for others to use. A software developer should spend most of his/her time on making complex things easier to use for others.

Best Practice methods require preconditions and they are absent, too. There are different kinds of software development projects, different kinds of project teams, and they require different methods. There's a huge difference between developing control software for a moon rocket, developing search algorithms or creating user interfaces for database applications. Unfortunately, it seems that most attempts to define Best Practice forget about preconditions.