Friday, 9 January 2009

Virtualization gives configurable risk profiles

As everybody knows, the risk of failure can be reduced using redundancy, so redundancy must be good. Also, redundancy is expensive, but fewer hardware boxes with lots of redundancy built-in means less costs and less risk at the same time, right? Now, here is the problem: If you virtualize 100 servers on a system, and the system fails, you have 100 virtual servers that fail.

Most organizations can survive the failure of one system. If a postal company cannot bill large customers for packages, but can bill everyone else, they still have cash flow until the problem is solved. But what happens if all systems fail at the same time?

Different organizations need different risk profiles. As software developers, we can use virtualizations to help this out. If our software supports several departments, we can choose how things should fail. If a storage system fails, should all departments lose access to the system (single point of failure = cheap), or just some of them (more expensive)? If the customer wants everything to be as cheap as possible, just put all the software on one physical server using virtualization. If the customer wants the system to be very stable, use several servers in a way that keeps employees productive even when some servers fail.

If it is too expensive to use multiple servers, use virtual servers on different host hardware, maybe even in different hosting centers. You can use 10% of one physical host in 10 different centers, and in the end, the costs are the same as using one physical host in one center.

The Google and Microsoft cloud systems try to solve this problem, too, but until further notice, virtual servers are the most advanced technology in most organizations, to solve this problem. What we can do as software developers, is to design our systems so that they work well, both distributed in several hosting centers, and on one physical server. Also, our systems must work in loosely coupled ways, that make it possible to keep the organization running for a while without a specific subsystem.

4 comments:

Anonymous said...

Imho, the point you brought up is basicaly a discrete, easy to calculate problem.

The actual interesting thing to discuss is here: "The Google and Microsoft cloud systems try to solve this problem, too [...]"

Clouds basically hide the hardware. In fact a cloud "floats" over multiple appliances, and may even reduce the risk of failure much better, as the number of appliances is almost unlimited (when talking about Google, Amazon etc) ...

Lars D said...

Clouds are about automating hosting. The actual improvement depends on the algorithms.

If you want to sell a commercial system to the customer, and you want to give advice on how to deploy it, cloud is normally not an option, and won't be for some time.

Anonymous said...

No matter how hard you try, it all usually comes down to a single point of failure.

Typically at the database server.

Failing that, definitely at the electrical socket.

Avoiding failure isn't a plan, it's a delusion - reducing the chances of failure is a good idea, but still not a plan. What you do when failure happens, now that is a plan.

Lars D said...

I don't agree. If an exchange server goes down, the local Outlook client still works, and the office user can make appointments, write e-mails, call contacts etc. Changes will not go online until the server is online again, but most users can continue working.

Another example is the use of a subversion repository. You can still do programming with Delphi on your local harddisk without connection to the subversion repository. If the server goes down, you can work. If your local computer goes down, you can check out the repository to another computer and work on that, until your own computer is back online.

If you have 10 geographically distributed departments doing some stuff, you can have 10 local databases and some replication. If 1 department goes down, the other 9 departments may be able to take over some of the load, until everything is online again.

The customer may choose to host the databases of 10 geographically distributed departments on a central server, or to host them decentralized. Maybe even combining them, so that offices in west European countries could be centralized in one hosting center, whereas the Russian office is hosted locally, so that they can continue to work when their unstable internet connection is down.