AirBNB didn't have to fail

Today part of Amazon Web Services failed, taking down with it a slew of startups that all run on Amazon’s Cloud infrastructure. AirBNB was one of the biggest, but also Heroku, Reddit, Minecraft, Flipboard & Coursera down with it. Its not the first time. What the heck happened, and why should we care?

1. Root Cause

The AWS service allows companies like AirBNB to build web applications, and host them on servers owned and managed by Amazon. The so-called raw iron of this army of compute power sits in datacenters. Each datacenter is a zone, and there are many in each of their service regions including US East (Northern Virginia), US West (Oregon), US West (Northern California), EU (Ireland), Asia Pacific (Singapore), Asia Pacific (Tokyo), South America (Sao Paulo), and AWS GovCloud.

Today one of those datacenters in the Northern Virginia region had a failure. What does this mean? Essentially firms like AirBNB that hosted their applications ONLY in Northern Virginia experienced outages.

As it turns out, Amazon has a service level agreement of 99.95% availability. We’ve long since said goodbye to the five nines. HA is overrated.

2. Use Redundancy

Although there are lots of pieces and components to a web infrastructure, two big ones are webservers and database servers. Turns out AirBNB could make both of these tiers redundant. How do we do it?

On the database side, you can use Amazon’s multi-az or alternately read-replicas. Each have different service characteristics so you’ll have to evaluate your application to figure out what will work for you.

Then there is the option to host mysql or Percona directly on Amazon servers yourself and use replication.

[quote]Using redundant components like placing webservers and databases in multiple regions, AirBNB could avoid an Amazon outage like Monday’s that affected only Northern Virginia.[/quote]
When do I want RDS versus mysql? Here are some use cases for RDS versus roll your own MySQL.

Now that you’re using multiple zones and regions for your database the hard work is completed. Webservers can be hosted in different regions easily, and don’t require complicated replication to do it.

3. Have a browsing only mode

Another step AirBNB can take to be resilient is to build a browsing only mode into their application. Often we hear about this option for performing maintenance without downtime. But it’s even more valuable during a situation like this. In a real outage you don’t have control over how long it lasts or WHEN it happens. So a browsing only mode can provide real insurance.

For a site like AirBNB this would mean the entire website was up and operating. Customers could browse and view listings, only when they went to book a room would the encounter an error. This would be a very small segment of their customers, and a much less painful PR problem.

Facebook has experience intermittent outages of it’s service. People hardly notice because they’ll often only see a message when they are trying to comment on someone’s wall post, send a message or upload a photo. The site is still operating, but not allowing changes. That’s what a browsing only mode affords you.

[quote]A browsing only mode can make a big difference, keeping most of the site up even when transactions or publish are blocked.
[/quote]

Drupal, an open source CMS system that powers sites like Adweek.com, TheHollywoodReporter.com, and Economist.com uses this technology. It supports a browsing only mode out of the box. An amazon outage like this one would only stop editors from publishing new stories temporarily. A huge win to sites that get 50 to 100 million with-an-m pageviews per month.

4. Web Applications need Feature Flags

Feature flags give you an on/off switch. Build them into heavy duty parts of your site, and you can disable those in an emergency. Host components multiple availability zones for extra peace of mind.

One of our all time most popular posts 5 Things Toxic to Scalability included some indepth discussion of feature flags.

5. Consider Netflix’s Simian Army

Netflix takes a very progressive approach to availability. They bake redundancy and automation right into all of their infrastructure. Then they run an app called the Chaos Monkey which essentially causes outages, randomly. If resilience from constantly falling and getting back up can’t make you stronger, I don’t know what can!

Take a look at the Netflix blog for details on intentional load & stress testing.

6. Use multiple cloud providers

If all of the above isn’t enough for you, taking it further you’d do as George Reese of enstratus recommends and use multiple cloud providers. Not being beholden to one company could help in more situations than just these type of service disruptions too.

Basic EC2 Best Practices mean building redundancy into your infrastructure. Multiple cloud providers simply take that one step further.

Read this far? Grab our newsletter on scalability and startups!

  • Igor Stravinsky

    What is “reduncandy”? Sounds tasty!

  • mrbungle

    Proofread much?
    Drupal, an open source CMS system that powers sites like Adweek.com, thehollywoodreporter.com *** more examples *** uses this technology.
    A huge win to sites that get 50 to 100 million with-an-m pageviews per month.

    • hullsean

      Thx for catching that. Will update that shortly.

  • http://twitter.com/horovits Dotan Horovits

    Thanks Sean for the great post. I’ve been following the AWS outages myself. 18 months ago, during the first major outage in the US-EAST-1 region, I analyzed the posts and data shared by the various websites that were affected by the outage, and more importantly by the ones which survived the outage, and extracted useful proven patterns and best practices (backed by the actual case studies) that one can implement to protect his system from such outages.
    there are quite a few similarities to what you mentioned in your post. Amazing how ever more relevant these patterns are, and how much they are still not common among cloud-based architectures, even after all the outages (5 major outages in past 18 months in the US-EAST-1 region alone!)
    I’ve been using these in my architectures and can testify that it’s simple and it works!
    http://horovits.wordpress.com/2011/05/16/retrospect-on-recent-aws-outage-and-resilient-cloud-based-architecture/

    • hullsean

      Thx for the comment Dotan. That is a great post. When this recent AWS outage was first being reported, I quickly found your blog post. Good stuff.

  • http://www.paulgraydon.co.uk Twirrim

    Might be worth updating the post, the multi-az redundancy provided by RDS was directly impacted, from all reports I’ve seen it completely failed to do its job. People have been paying extra to get what seems to have ended up being no discernable benefit!

    • hullsean

      Thx for the comments Twirrim. RDS is a mixed bag, and isn’t the solution for everybody. It *seems* though I don’t know for sure that Multi-az uses something like DRBD under the hood, where read-replicas use mysql replication. So the latter is resilient enough for alternate regions, while multi-az being a distributed block device needs to be closer in proximity, otherwise the latency would kill you during 2-phase commit.

      Often I recommend customers go with roll-your-own Percona as then you have control over things. Keeping it simple with lots of mysql statement based replication slaves and webservers in alternate regions, combined with a browsing only mode and you should be able to withstand an outage like Amazon had.

  • adamsb6

    In addition RDS, the Elastic Load Balancers also got themselves into a bad state that would require some manual intervention.

    In our case, we saw that our ELB had started replying that it had 3 IPs routing to the app servers we host in two availability zones, none of which rely on EBS. Two of those IPs were fine, but one was just accepting connections and timing them out. This effectively meant that 1/3 of all requests were timing out.

    In hindsight, we could have replicated the DNS portion of the ELB manually, setting up a new A record with the two good IPs, and then switching our CNAME to the ELB to point to the new A record instead. That would have been susceptible to the ELB suddenly switching up its IPs, but we could have provided a better user experience until that happened.

    • hullsean

      All good points @adamsb6:disqus . Thx for the comments.

  • Pingback: IT Operations News Roundup — Oct 22nd to 28th | Web Performance Monitoring and Optimization

  • Pingback: Real Disaster Recovery Lessons from Sandy

  • Pingback: Cloud Operations Interview

  • http://www.QuoteCloudComputing.com/ QuoteCloudComputing

    Great post. I like option #6 the best. Too many companies rely on one cloud provider. To achieve maximum up time, you need to use several providers….bottom line. Thanks for the great info!

    • hullsean

      Yep. Agreed. Not always as simple to architect, but for world class availability it’s crucial.

  • Pingback: Cloud DBA and Management Interview

  • Pingback: No iPhones Were Harmed in the Creation of this Outage

  • Pingback: 5 Ways to Make Cloud Failure Not an Option | Krantenkoppen Tech

  • Pingback: Facebook, Is Anybody Listening?

  • Pingback: Why your cloud is speeding for a scalability cliff