As most have heard by now, last Friday saw a serious DDOS attack against one of the major US DNS providers, Dyn.
Join 32,000 others and follow Sean Hull on twitter @hullsean.
DNS being such a critical dependency, this affected many businesses across the board. We’re talking twitter, etsy, github, Airbnb & Reddit to name just a few. In fact Amazon Web Services itself was severely affected. And with so many companies hosting on the Amazon cloud, it’s no wonder this took down so much of the internet.
1. What happened?
According to Brian Krebs, a Mirai botnet was responsible for the attack. What’s even scarier, those requests originated for IOT devices. You know, baby monitors, webcams & DVRs. You’ve secured those right? 🙂
Brian has posted a list of IOT device makers that have backdoors & default passwords and are involved. Interesting indeed.
2. What can be done?
Companies like Dyn & Cloudflare among others spend plenty of energy & engineering resources studying attacks like this one, and figuring out how to reduce risk exposure.
But what about your startup in particular? How can we learn from these types of outages? There are a number of ways that I outline below.
3. What are your dependencies?
After an outage like the Dyn one, it’s an opportunity to survey your systems. Take stock of what technologies, software & services you rely on. This is something your ops team can & likely wants to do.
What components does your stack rely on? Which versions are hardest to upgrade? What hardware or services do you rely on? Which APIs do you call out to? Which steps or processes are still manual?
Related: The myth of five nines
4. Put your eggs in many baskets
Awareness around your dependencies, helps you see where you may need to build in redundancy. Can you setup a second cloud provider for DR? Can you use an alternate API to get data, when your primary is out? For which dependencies are your hands tied? Where are your weaknesses?
5. Don’t assume five nines
The gold standard in technology & startup land has been 5 nines availability. This is the SLA we’re expected to shoot for. I’ve argued before (see: myth of five nines) that it’s rarely ever achieved. Outages like this one, bringing hours long downtime, kill hour 5 nines promise for years. That’s because 5 nines means only 5 ½ minutes downtime per year!
Better to be realistic that outages can & will happen, manage & mitigate, and be realistic with your team & your customers.