Tag Archives: resilience

How can startups learn from the Dyn DNS outage?

storm coming

As most have heard by now, last Friday saw a serious DDOS attack against one of the major US DNS providers, Dyn.

Join 32,000 others and follow Sean Hull on twitter @hullsean.

DNS being such a critical dependency, this affected many businesses across the board. We’re talking twitter, etsy, github, Airbnb & Reddit to name just a few. In fact Amazon Web Services itself was severely affected. And with so many companies hosting on the Amazon cloud, it’s no wonder this took down so much of the internet.

1. What happened?

According to Brian Krebs, a Mirai botnet was responsible for the attack. What’s even scarier, those requests originated for IOT devices. You know, baby monitors, webcams & DVRs. You’ve secured those right? 🙂

Brian has posted a list of IOT device makers that have backdoors & default passwords and are involved. Interesting indeed.

Also: Is a dangerous anti-ops movement gaining momentum?

2. What can be done?

Companies like Dyn & Cloudflare among others spend plenty of energy & engineering resources studying attacks like this one, and figuring out how to reduce risk exposure.

But what about your startup in particular? How can we learn from these types of outages? There are a number of ways that I outline below.

Also: How do we lock down systems from disgruntled engineers?

3. What are your dependencies?

After an outage like the Dyn one, it’s an opportunity to survey your systems. Take stock of what technologies, software & services you rely on. This is something your ops team can & likely wants to do.

What components does your stack rely on? Which versions are hardest to upgrade? What hardware or services do you rely on? Which APIs do you call out to? Which steps or processes are still manual?

Related: The myth of five nines

4. Put your eggs in many baskets

Awareness around your dependencies, helps you see where you may need to build in redundancy. Can you setup a second cloud provider for DR? Can you use an alternate API to get data, when your primary is out? For which dependencies are your hands tied? Where are your weaknesses?

Read: Is AWS too complex for small dev teams?

5. Don’t assume five nines

The gold standard in technology & startup land has been 5 nines availability. This is the SLA we’re expected to shoot for. I’ve argued before (see: myth of five nines) that it’s rarely ever achieved. Outages like this one, bringing hours long downtime, kill hour 5 nines promise for years. That’s because 5 nines means only 5 ½ minutes downtime per year!

Better to be realistic that outages can & will happen, manage & mitigate, and be realistic with your team & your customers.

Also: Is AWS a patient that needs constant medication?

Get more. Grab our exclusive monthly Scalable Startups. We share tips and special content. Our latest Why I don’t work with recruiters

Why Dropbox didn’t have to fail

dropbox outage dec 2015

Dropbox is currently experiencing a *major* outage. See the dropbox status page to get an update.

Join 32,000 others and follow Sean Hull on twitter @hullsean.

I’ve written about outages a lot before. Are these types of major failures avoidable? Can we build better, with redundant services so everything doesn’t fall over at once?

Here’s my take.

1. Browse only mode

The first thing Dropbox can do to be more resilient is to build a browsing only mode into the application. Often we hear about this option for performaing maintenance without downtime. But it’s even more important during a real outage like Dropbox is currently experiencing.

Not if but *when* it happens, you don’t have control over how long it lasts. So browsing only can provide you with real insurance.

For a site like Dropbox it would mean that the entire website is still up and operating. Customers can browse their documents, view listings of files & download those files. However they would not be able to *add* or change files during the outage. Thus only a very small segment of customers is interrupted, and it becomes a much smaller PR problem to manage.

Facebook has experienced outages of service. People hardly notice because they’ll often only see a message when they try to comment on someone’s wall post, send a message or upload a photo. The site is still operating, but not allowing changes. That’s what a browsing only mode affords you.

A browsing only mode can make a big difference, keeping most of the site up even when transactions or publish are blocked.

Drupal is an open source platform that powers big publishing sites like Adweek, hollywoodreporter.com & economist.com. It supports a browsing only mode out of the box. An outage like this one would only stop editors from publishing new stories temporarily. It would be a huge win to sites that get 50 to 100 million with-an-m visitors per month.

Also: Is Amazon too big to fail

2. Redundancy

There are lots of components to a web infrastructure. Two big ones are webservers & databases. Turns out Dropbox could make both tiers redundant. How do we do it?

On the database side, you can take advantage of Amazon’s RDS & either read-replicas or Multi-AZ. Each have different service characteristics, so you’ll need to evaluate your app to figure out what works best.

You can also host MySQL, Percona or Mariadb direclty on Amazon instances yourself & then use replication.


Using redundant components like placing webservers and databases in multiple regions, Dropbox could avoid a major outage like they’re experiencing this weekend.

Wondering about MySQL versus RDS? Here are some uses cases.

Now that you’re using multiple zones & regions for your database the hard work is completed. Webservers can be hosted in different regions easily, and don’t require complicated replication to do it.

Related: Are SQL databases dead?

3. Feature flags

On/off switches are something we’re all familiar with. We have them in the fuse box in our house or apartment. And you’ll also find a bigger larger shutoff in the basement.

Individual on/off switches are valuable because they allow us to disable inessential features. We can build them into heavier parts of a website, allowing us to shutdown features in an emergency. Host components in multiple availability zones for extra piece of mind.

Read: 5 Things toxic to scalability

4. Simian armies

Netflix has taken a more progressive & proactive approach to outages. They introduce their own! Yes that’s right they bake redundancy & automation right into all of their infrastructure, then have a loose canon piece of software called Chaos Monkey that periodically kills servers. Did I hear that right? Yep it actually nocks components offline, to actively test the system for resiliency.

Take a look at the Netflix blog for details on intentional load & stress testing.

Also: When hosting data on Amazon turns bloodsport

5. Multiple clouds

If all these suggestions aren’t enough for you, taking it further you could do what George Reese of enstratus recommends and use multiple cloud providers. Not being dependant on one company could help in many situations, not just the ones described here.

Basic Amazon EC2 best practices require building redundancy into your infrastructure. Virtual servers & on-demand components are even less reliable than commodity hardware we’re familiar with. Because of that, we must use Amazon’s automation to insure us against expected failure.

Also: Why I like Etsy’s site performance report

Get more. Grab our exclusive monthly Scalable Startups. We share tips and special content. Our latest Why I don’t work with recruiters