Friday, 9 November 2007

The price of using GUIDs in databases

There has been some discussion about the use of GUIDs lately. A GUID is a 128-bit integer that is picked randomly, and that is obviously a good thing, if your database needs more than 264=18×1018 records, but because it is 128 bit, you can be quite sure that this random number has not already been used somewhere else. The difference between an autoincrementing 128 bit integer and GUID is, that GUID values are always picked randomly.

It makes sense to apply GUIDs when:
  • No specific order is required
  • 128 bit is not considered a waste of space
  • A very small chance of not succeeding to pick a unique number is ok
  • Values cannot be produced in one place, or having no specific order is a feature
Microsoft recommends to use GUIDs as primary key because it enables replication between different databases. When you do that, the chance of having conflicts is very small - for instance, two databases with each 1 billion records, can have these merged easily, and the chance of primary key conflict is only 109×(109/1038)=10-20.

However, it comes at a price. There is a big chance, that two records, that are added shortly after each other, are related. For instance, if you want to save an invoice, there may be 5 records that describe items on that invoice, which are added as part of the same transaction. If a database server uses autoincrementing integer values as primary key, and fully or partially physically sorts records by this primary key, these 5 records will probably go into 1 or 2 places on the harddisk. If GUIDs were used, they would be stored in 5 different places on the harddisk. This is one of the reasons why GUID-based databases are usually on servers that have more RAM than they have data - they need to cache everything.

Another price comes when debugging. You need more IQ to debug code than to write code, so it is important that you optimize for debugging. It must be easy to see, that the data stored in the database is correct. GUIDs are not always the easiest key to read, especially not in developer databases, that tend to have very little data, and therefore very small numbers in autoincrementing integer fields.


Anonymous said...

Physical arrangement on the disk depends on your clustered index, which doesn't have to be on the primary key. Cluster it by createdate, and records inserted close together in time will be close on the disk, no matter what your primary key is.

This isn't necessarily what you want...if you have a lot of inserts, it can be better to cluster by something else so you don't get a "hotspot." In fact, in that situation a cluster on a guid might even be a good idea.

Generally tho, if there is any column for which you typically extract a range of data, use that for your clustered index.

Anonymous said...

If you don't cluster by GUIDs, but want to look up records by GUIDs, then the database will need to make a separate index for the GUIDs, and in that index, the GUID values will be sorted by GUID values.

If you want to look up 5 different GUID values in a secondary index, the harddisk will need to find these in 5 different places. So even if the actual records are located together, you didn't solve the problem, you just added extra lookups and made everything slower.