Newsletter 74 – Design For Failure

It may sound like a pessimistic view of computing systems, but the fact is all of the components that make up the modern Internet stack have a certain failure rate. So looking at that realistically, planning for a break-down so you can manage it better, is essential.

Failures in traditional datacenters

In your own datacenter, or that of your managed hosting provider sit racks and racks of servers. Typically an proactive system administrator will keep a lot of spare parts around, hard drives, switches, additional servers etc. Although you don’t need them now, you don’t want to be in a position to have to order new equipment when it fails.  That would increase your recovery time dramatically.

Besides keeping extra components lying around, you also typically want to avoid the so-called single point of failure. Dual power systems, switches, database servers, webservers etc. We also see RAID as sort of standard now in all modern servers as a loss of commodity sata drive is so common. Yet this redundancy makes it a non-event. We are expecting it and so design for it.

And while we are prudent enough to perform backups regularly and document the layout of systems, rarely is the environment in a traditional datacenter completely scripted. Although attempts to test backups, and restore the database may be common, a full fire drill to rebuild everything is rarer.

Failure in the Cloud

In the last decade we saw Linux on commodity take over as the internet platform of choice because of the huge cost differential as compared to traditional hardware such as Sun or HP.   The hardware was more likely to fail, but being 1/10th the price meant you could build redundancy in to cover yourself and still save money.

The latest wave of cloud providers are bringing the same types of costs savings. But cloud hosted servers, for instance in Amazon EC2 are much less reliable than typical rack mounted servers you might have in your datacenter.

Planning for disaster recovery we agree is a really good idea, but sometimes it gets pushed aside by other priorities. In the cloud it moves to front and center as an absolute necessity. This forces a new, more robust approach to rebuilding your environment with scripts documenting and formalizing your processes.

This is all a good thing as hardware failure then becomes an expected occurrence. Failures are a given, it’s how quickly you recover that makes the difference.

Book Review:

Cloud Application Architectures by George Reese
Originally picked up this book expecting a very hands on guide to cloud deployments, especially on EC2. That is not what this book is though. It’s actually a very good CTO targeted book, covering difficult questions like cost comparisons between cloud and traditional datacenter hosting, security implications, disaster recovery, performance and service levels. The book is very readable, and not overly technical.

  • Matt Maldre

    Huh. my comment disappeared.

    • Sean Hull

      Please repost Matt. When did you comment?

      • Matt Maldre

        Just a few moments before the published comment. I haven’t seen that problem happen on disqus before that. Anyhow…

        Here’s approximately what I said: When you tweeted, “How to design for failure” I originally thought you mean graphic design. It’s interesting to compare failure in system design vs graphic design today. As a graphic designer, I can say that perhaps we (graphic designers) have gotten too careful with our designs. What’s that saying? If you don’t ever fail, then you aren’t pushing yourself enough. Of course, in the world of system design, it’s not so much about pushing yourself to experiment to fail, as it’s more about the possibility of the system failing and putting safe measures in place.

  • Sean Hull

    Yep, exactly Matt. It’s all about possibilities & scenarios and trying to mitigate potential damage.