Dropbox is currently experiencing a *major* outage. See the dropbox status page to get an update.
Join 32,000 others and follow Sean Hull on twitter @hullsean.
I’ve written about outages a lot before. Are these types of major failures avoidable? Can we build better, with redundant services so everything doesn’t fall over at once?
Here’s my take.
1. Browse only mode
The first thing Dropbox can do to be more resilient is to build a browsing only mode into the application. Often we hear about this option for performaing maintenance without downtime. But it’s even more important during a real outage like Dropbox is currently experiencing.
Not if but *when* it happens, you don’t have control over how long it lasts. So browsing only can provide you with real insurance.
For a site like Dropbox it would mean that the entire website is still up and operating. Customers can browse their documents, view listings of files & download those files. However they would not be able to *add* or change files during the outage. Thus only a very small segment of customers is interrupted, and it becomes a much smaller PR problem to manage.
Facebook has experienced outages of service. People hardly notice because they’ll often only see a message when they try to comment on someone’s wall post, send a message or upload a photo. The site is still operating, but not allowing changes. That’s what a browsing only mode affords you.
A browsing only mode can make a big difference, keeping most of the site up even when transactions or publish are blocked.
Drupal is an open source platform that powers big publishing sites like Adweek, hollywoodreporter.com & economist.com. It supports a browsing only mode out of the box. An outage like this one would only stop editors from publishing new stories temporarily. It would be a huge win to sites that get 50 to 100 million with-an-m visitors per month.
There are lots of components to a web infrastructure. Two big ones are webservers & databases. Turns out Dropbox could make both tiers redundant. How do we do it?
On the database side, you can take advantage of Amazon’s RDS & either read-replicas or Multi-AZ. Each have different service characteristics, so you’ll need to evaluate your app to figure out what works best.
You can also host MySQL, Percona or Mariadb direclty on Amazon instances yourself & then use replication.
Using redundant components like placing webservers and databases in multiple regions, Dropbox could avoid a major outage like they’re experiencing this weekend.
Now that you’re using multiple zones & regions for your database the hard work is completed. Webservers can be hosted in different regions easily, and don’t require complicated replication to do it.
Related: Are SQL databases dead?
3. Feature flags
On/off switches are something we’re all familiar with. We have them in the fuse box in our house or apartment. And you’ll also find a bigger larger shutoff in the basement.
Individual on/off switches are valuable because they allow us to disable inessential features. We can build them into heavier parts of a website, allowing us to shutdown features in an emergency. Host components in multiple availability zones for extra piece of mind.
4. Simian armies
Netflix has taken a more progressive & proactive approach to outages. They introduce their own! Yes that’s right they bake redundancy & automation right into all of their infrastructure, then have a loose canon piece of software called Chaos Monkey that periodically kills servers. Did I hear that right? Yep it actually nocks components offline, to actively test the system for resiliency.
Take a look at the Netflix blog for details on intentional load & stress testing.
5. Multiple clouds
If all these suggestions aren’t enough for you, taking it further you could do what George Reese of enstratus recommends and use multiple cloud providers. Not being dependant on one company could help in many situations, not just the ones described here.
Basic Amazon EC2 best practices require building redundancy into your infrastructure. Virtual servers & on-demand components are even less reliable than commodity hardware we’re familiar with. Because of that, we must use Amazon’s automation to insure us against expected failure.