3 Things Devops Can Learn from Aviation

I recent went on a flight with a pilot friend & two others. With only 3 passengers, you get up front and center to all the action. Besides getting to chat directly with the pilot, you also see all of the many checks & balances in place. They call it the warrior aviation checklist. It inspired this post on disaster recovery in the datacenter.

Join 8000 others and follow Sean Hull on twitter @hullsean.

1. Have a real plan B

Looking over the document you can see there are procedures for engine failure in flight! That’s right those small planes can have an engine failure and they glide. So for starters the technology is designed for failure. But the technology design is not the end of the story.

When you have procedures and processes in place, and training to deal with failure, you’re expecting and planning for it.

Also check out 5 Conversational Ways to Evaluate Great Consultants.

[quote]
Expect & plan for failure. Use technologies designed to fail gracefully and build software to do so. Then put procedures in place to detect & alert you so you can put those processes into place quickly.
[/quote]

2. Use checklists

Checklists are an important part of good process. We use them for code deploys. We can use them for disaster recovery too. But how to create them?

Firedrills! That’s right, run through the entire process of disaster recovery with your team & document. As you do so, you’ll be creating the master checklist for real disaster recovery. Keep it in a safe place, or better multiple places so you’ll have it when you really need it.

Read this Scalability Tips & Greatest Hits.

3. Trust your instruments

Modern infrastructures include 10’s or 100’s of servers. With cloud instances so simple to spinup, we add new services and servers daily. What to do?

You obviously need to use the machines to monitor the machines. Have a good monitoring system like Nagios, and metrics collection such as New Relic to help you stay ahead of failure.

Setup the right monitoring automation, and then trust the instruments!

Related Real disaster recovery lessons from Sandy.

Get some in your inbox: Exclusive monthly Scalable Startups. We share tips and special content. Here’s a sample

Amazon Web Services – What is it and why is it important?

Amazon Web Services is a division of Amazon the bookseller, but this part of the business is devoted solely to infrastructure and internet servers.  These are the building blocks of data centers, the workhorses of the internet.  AWS’s offering of Cloud Computing solutions allows a business to setup or “spinup” in the jargon of cloud computing, new compute resources at will.  Need a small single cpu 32bit ubuntu server with two 20G disks attached?  One command and 30 seconds away, and you can have that!

As we discussed previously, Infrastructure Provisioning has evolved dramatically over the past fifteen years from something took time and cost a lot, to a fast automatic process that it is today with cloud computing.  This has also brought with it a dramatic culture shift in the way that systems administration is being done, from a fairly manual process of physical machines, and software configuration, one that took weeks to setup new services, to a scriptable and automateable process that can then take seconds.

This new realm of cloud computing infrastructure and provisioning is called Infrastructure as a Service or IaaS, and Amazon Web Services is one of the largest providers of such compute resources.  They’re not the only ones of course.  Others include:

  • Rackspace Cloud
  • Joyent
  • GoGrid
  • Terremark
  • 3Tera
  • IBM
  • Microsoft
  • Enomaly
  • AT&T

Cloud Computing is still in it’s infancy, but is growing quickly.   Amazon themselves had a major data center outage in April that we discussed in detail. It sent some hot internet startups into a tailspin!

More discussion of Amazon Web Services on Quora – Sean Hull

Introduction to EC2 Cloud Deployments

Cloud Computing holds a lot of promise, but there are also a lot of speed bumps in the road along the way.

In this six part series we’re going to cover a lot of ground.  We don’t intend this series to be an overly technical nuts and bolts howto.  Rather we will discuss high level issues and answer questions that come up for CTOs, business managers, and startup CEOs.

Some of the tantalizing issues we’ll address include:

  • How do I make sure my application is built for the cloud with scalability baked into the architecture?
  • I know disk performance is crucial for my database tier.  How do I get the best disk performance with Amazon Web Services & EC2?
  • How do I keep my AWS passwords, keys & certificates secure?
  • Should I be doing offsite backups as well, or are snapshots enough?
  • Cloud providers such as Amazon seem to have poor SLAs (service level agreements).  How do I mitigate this using availability zones & regions?
  • Cloud hosting environments like Amazons provide no perimeter security.  How do I use security groups to ensure my setup is robust and bulletproof?
  • Cloud deployments change the entire procurement process, handing a lot of control over to the web operations team.  How do I ensure that finance and ops are working together, and a ceiling budget is set and implemented?
  • Reliability of Amazon EC2 servers is much lower than traditional hosted servers.  Failure is inevitable.  How do we use this fact to our advantage, forcing discipline in the deployment and disaster recovery processes?  How do I make sure my processes are scripted & firedrill tested?
  • Snapshot backups and other data stored in S3 are somewhat less secure than I’d like.  Should I use encryption to protect this data?  When and where should I use encrypted filesystems to protect my more sensitive data?
  • How can I best use availability zones and regions to geographically disperse my data and increase availability?

As we publish each of the individual articles in this series we’ll link them to the titles below.  So check back soon!

  • Building Highly Scalable Web Applications for the Cloud
  • Managing Security in Amazon Web Services
  • MySQL Databases in the Cloud – Best Practices
  • Backup and Recovery in the Cloud – A Checklist
  • Cloud Deployments – Disciplined Infrastructure
  • Cloud Computing Use Cases
  • Newsletter 74 – Design For Failure

    It may sound like a pessimistic view of computing systems, but the fact is all of the components that make up the modern Internet stack have a certain failure rate. So looking at that realistically, planning for a break-down so you can manage it better, is essential.

    Failures in traditional datacenters

    In your own datacenter, or that of your managed hosting provider sit racks and racks of servers. Typically an proactive system administrator will keep a lot of spare parts around, hard drives, switches, additional servers etc. Although you don’t need them now, you don’t want to be in a position to have to order new equipment when it fails.  That would increase your recovery time dramatically.

    Besides keeping extra components lying around, you also typically want to avoid the so-called single point of failure. Dual power systems, switches, database servers, webservers etc. We also see RAID as sort of standard now in all modern servers as a loss of commodity sata drive is so common. Yet this redundancy makes it a non-event. We are expecting it and so design for it.

    And while we are prudent enough to perform backups regularly and document the layout of systems, rarely is the environment in a traditional datacenter completely scripted. Although attempts to test backups, and restore the database may be common, a full fire drill to rebuild everything is rarer.

    Failure in the Cloud

    In the last decade we saw Linux on commodity take over as the internet platform of choice because of the huge cost differential as compared to traditional hardware such as Sun or HP.   The hardware was more likely to fail, but being 1/10th the price meant you could build redundancy in to cover yourself and still save money.

    The latest wave of cloud providers are bringing the same types of costs savings. But cloud hosted servers, for instance in Amazon EC2 are much less reliable than typical rack mounted servers you might have in your datacenter.

    Planning for disaster recovery we agree is a really good idea, but sometimes it gets pushed aside by other priorities. In the cloud it moves to front and center as an absolute necessity. This forces a new, more robust approach to rebuilding your environment with scripts documenting and formalizing your processes.

    This is all a good thing as hardware failure then becomes an expected occurrence. Failures are a given, it’s how quickly you recover that makes the difference.

    Book Review:

    Cloud Application Architectures by George Reese
    Originally picked up this book expecting a very hands on guide to cloud deployments, especially on EC2. That is not what this book is though. It’s actually a very good CTO targeted book, covering difficult questions like cost comparisons between cloud and traditional datacenter hosting, security implications, disaster recovery, performance and service levels. The book is very readable, and not overly technical.