Tag Archives: disasterrecovery

AirBNB didn't have to fail

Today part of Amazon Web Services failed, taking down with it a slew of startups that all run on Amazon’s Cloud infrastructure. AirBNB was one of the biggest, but also Heroku, Reddit, Minecraft, Flipboard & Coursera down with it. Its not the first time. What the heck happened, and why should we care?

1. Root Cause

The AWS service allows companies like AirBNB to build web applications, and host them on servers owned and managed by Amazon. The so-called raw iron of this army of compute power sits in datacenters. Each datacenter is a zone, and there are many in each of their service regions including US East (Northern Virginia), US West (Oregon), US West (Northern California), EU (Ireland), Asia Pacific (Singapore), Asia Pacific (Tokyo), South America (Sao Paulo), and AWS GovCloud.

Today one of those datacenters in the Northern Virginia region had a failure. What does this mean? Essentially firms like AirBNB that hosted their applications ONLY in Northern Virginia experienced outages.

As it turns out, Amazon has a service level agreement of 99.95% availability. We’ve long since said goodbye to the five nines. HA is overrated.

2. Use Redundancy

Although there are lots of pieces and components to a web infrastructure, two big ones are webservers and database servers. Turns out AirBNB could make both of these tiers redundant. How do we do it?

On the database side, you can use Amazon’s multi-az or alternately read-replicas. Each have different service characteristics so you’ll have to evaluate your application to figure out what will work for you.

Then there is the option to host mysql or Percona directly on Amazon servers yourself and use replication.

[quote]Using redundant components like placing webservers and databases in multiple regions, AirBNB could avoid an Amazon outage like Monday’s that affected only Northern Virginia.[/quote]
When do I want RDS versus mysql? Here are some use cases for RDS versus roll your own MySQL.

Now that you’re using multiple zones and regions for your database the hard work is completed. Webservers can be hosted in different regions easily, and don’t require complicated replication to do it.

3. Have a browsing only mode

Another step AirBNB can take to be resilient is to build a browsing only mode into their application. Often we hear about this option for performing maintenance without downtime. But it’s even more valuable during a situation like this. In a real outage you don’t have control over how long it lasts or WHEN it happens. So a browsing only mode can provide real insurance.

For a site like AirBNB this would mean the entire website was up and operating. Customers could browse and view listings, only when they went to book a room would the encounter an error. This would be a very small segment of their customers, and a much less painful PR problem.

Facebook has experience intermittent outages of it’s service. People hardly notice because they’ll often only see a message when they are trying to comment on someone’s wall post, send a message or upload a photo. The site is still operating, but not allowing changes. That’s what a browsing only mode affords you.

[quote]A browsing only mode can make a big difference, keeping most of the site up even when transactions or publish are blocked.

Drupal, an open source CMS system that powers sites like Adweek.com, TheHollywoodReporter.com, and Economist.com uses this technology. It supports a browsing only mode out of the box. An amazon outage like this one would only stop editors from publishing new stories temporarily. A huge win to sites that get 50 to 100 million with-an-m pageviews per month.

4. Web Applications need Feature Flags

Feature flags give you an on/off switch. Build them into heavy duty parts of your site, and you can disable those in an emergency. Host components multiple availability zones for extra peace of mind.

One of our all time most popular posts 5 Things Toxic to Scalability included some indepth discussion of feature flags.

5. Consider Netflix’s Simian Army

Netflix takes a very progressive approach to availability. They bake redundancy and automation right into all of their infrastructure. Then they run an app called the Chaos Monkey which essentially causes outages, randomly. If resilience from constantly falling and getting back up can’t make you stronger, I don’t know what can!

Take a look at the Netflix blog for details on intentional load & stress testing.

6. Use multiple cloud providers

If all of the above isn’t enough for you, taking it further you’d do as George Reese of enstratus recommends and use multiple cloud providers. Not being beholden to one company could help in more situations than just these type of service disruptions too.

Basic EC2 Best Practices mean building redundancy into your infrastructure. Multiple cloud providers simply take that one step further.

Read this far? Grab our newsletter on scalability and startups!

Road War Story – Hacking Inflight Solutions


The 2am phone call

Last summer I got my call from the president at 2am.  Actually it was my former boss at Hollywood Reporter.  I had worked there three months previous, and they had since hired an outsourced DBA solution.  Big outsource, big chops.  And big fail.



12 hours to liftoff

I was scrambling to pack my luggage to go on summer vacation.  I was bound for SF at the moment and my flight was leaving in the morning.  I was trying to wrap up loose ends and my former boss was entreating me – “Can you help us?  Our replication setup has just melted down.  We need you to cleanup the mess.”

The so-called pain point

After a few more early am Skype calls and chats, the team retired for the night and I finished packing my bags.   I snuck in an hour of sleep then headed straight for the airport.  Once through airport security, I bust out my laptop and start logging into the servers.

Although the exact cause of the replication failure remained opaque, I was asked to scan both databases and determine differences.  Out of my toolbox comes the perfect tool for the job, pt-table-checksum, and I run scans on both databases.  (For the curious, here is how) I find countless records different between the two databases.

Now my flight is boarding, so I pack up the laptop and find my seat.  As soon as the seat belt lights flash off, I’m flipping open my macbook at getting inflight wifi working.  Through the flight I’m on SKYPE with the team, with command line terminals open to the servers.  Discuss, debug, troubleshoot – rinse, repeat.

From there I write up a report and explain to the team & CTO the problem.  Syncing that many different records is too risky.  We’d have to review all the statements one-by-one.  I’d rather rebuild replication from scratch.

From there the CTO gives the go ahead, and with the help of Percona’s xtrabackup to do online hotbackups, we are able to fix replication without downtime. Amen to that!

Now with our primary MySQL database and secondary read-only one back online, things calm down a lot.  Traffic returns to a smooth predictable 2 million pageviews per day.  That’s smooth and predictable on a site that gets 50 million a month!   The database loads are calm and steady, as our all of our nerves.   In the coming days we continue to monitor the situation, and write up lengthly root cause analysis of the situation.

Freelancers & Consultants take note

To my recent Consulting 101 article I would add the following bullets:

  • Responsiveness is crucial
  • Be there when a client needs you, and your value goes up.  Be reliable, and loyal to those you’ve worked with.

  • Be an integral part of your team
  • Everyone knows eachother virtual or in real life, and are comfortable with the parts they play.  A team that can work together is crucial, whether it’s all fulltime folks, some consultants, some outsourced or wherever they may be.  Each has a role to play, and communication and team work brings it all together.

  • Have laptop will travel
  • I never turn down a job.  There will be plenty of time for vacations and rest when the dust settles.

  • Don’t break things
  • If there is any doubt in your mind, test, and test again.  Always err on the side of caution.  Check thrice and cut once!  If you haven’t done an operation ten, twenty or fifty times before, experiment a few more times with options to be sure.  And most importantly, if you don’t login to the systems you’re working on regularly, you better make damn sure you’re on the right box, flipping the right switch, and moving the right dials.  With modern internet infrastructure, there are a hundred ways to push the wrong red button!

    CTOs and Directors of Operations take note

  • Small & Nimble wins the day
  • I’ve used this value proposition before when speaking to prospects.  You can hire a big firm, and be a small fish to them.  Small fish means you’re gonna get less attention.  OR you can hire a small firm or contractor.  Then you’ll be a big fish to him or her.  Guess what?  If you’re their big fish, they’re gonna pay extra attention to every move they make, and ensure things don’t break.  They can’t afford mistakes, not to their reputation or their bottom line.  Not like the big boys can.

  • Choose passionate, yet conservative & risk averse operations folks
  • In developers you’re building technology, features, and forging ahead into new solutions.  The role is more to create waves, and break barriers.  How can we enable new business processes and so forth?

    In hiring operations personnel you want stability.  Look for individuals who are more risk averse.  This conservative streak is a countering force.  Ops teams are tasked with that job of bringing a steady state to your business services.  They don’t want to wake up at 2am in the morning.