Tag Archives: fail

How can startups learn from the Dyn DNS outage?

storm coming

As most have heard by now, last Friday saw a serious DDOS attack against one of the major US DNS providers, Dyn.

Join 32,000 others and follow Sean Hull on twitter @hullsean.

DNS being such a critical dependency, this affected many businesses across the board. We’re talking twitter, etsy, github, Airbnb & Reddit to name just a few. In fact Amazon Web Services itself was severely affected. And with so many companies hosting on the Amazon cloud, it’s no wonder this took down so much of the internet.

1. What happened?

According to Brian Krebs, a Mirai botnet was responsible for the attack. What’s even scarier, those requests originated for IOT devices. You know, baby monitors, webcams & DVRs. You’ve secured those right? 🙂

Brian has posted a list of IOT device makers that have backdoors & default passwords and are involved. Interesting indeed.

Also: Is a dangerous anti-ops movement gaining momentum?

2. What can be done?

Companies like Dyn & Cloudflare among others spend plenty of energy & engineering resources studying attacks like this one, and figuring out how to reduce risk exposure.

But what about your startup in particular? How can we learn from these types of outages? There are a number of ways that I outline below.

Also: How do we lock down systems from disgruntled engineers?

3. What are your dependencies?

After an outage like the Dyn one, it’s an opportunity to survey your systems. Take stock of what technologies, software & services you rely on. This is something your ops team can & likely wants to do.

What components does your stack rely on? Which versions are hardest to upgrade? What hardware or services do you rely on? Which APIs do you call out to? Which steps or processes are still manual?

Related: The myth of five nines

4. Put your eggs in many baskets

Awareness around your dependencies, helps you see where you may need to build in redundancy. Can you setup a second cloud provider for DR? Can you use an alternate API to get data, when your primary is out? For which dependencies are your hands tied? Where are your weaknesses?

Read: Is AWS too complex for small dev teams?

5. Don’t assume five nines

The gold standard in technology & startup land has been 5 nines availability. This is the SLA we’re expected to shoot for. I’ve argued before (see: myth of five nines) that it’s rarely ever achieved. Outages like this one, bringing hours long downtime, kill hour 5 nines promise for years. That’s because 5 nines means only 5 ½ minutes downtime per year!

Better to be realistic that outages can & will happen, manage & mitigate, and be realistic with your team & your customers.

Also: Is AWS a patient that needs constant medication?

Get more. Grab our exclusive monthly Scalable Startups. We share tips and special content. Our latest Why I don’t work with recruiters

How 1and1 failed me

1and1 fail

I manage this blog myself. Not just the content, but also the technology it runs on. The systems & servers are from a hosting company called 1and1.com. And recently I had some serious problems.

Join 31,000 others and follow Sean Hull on twitter @hullsean.

The publishing platform wordpress, as a few versions out of date. Because of that some vulnerabilities surfaced.

1. Malware from Odessa

While my eyes were on content, some russian hackers managed to scan my server & due to the older version of wordpress, found a way to install some malware onto the box. This would be invisible to most users, but was nevertheless dangerous. As a domain name with a fifteen year life, it has some credibility among the algorithms & search engines. There’s some trust there.

Google identified the malware, and emailed me about it. That was the first I was alerted in mid-August. That was a few days before I left for vacation, but given the severity of it, I jumped on the problem right away.

Also: Why I say Always be publishing

2. Heading off a lockout

I ordered up a new server from 1and1.com to rebuild. I then set to work moving over content, and completely reinstalled the latest version of wordpress.

Since it was within the old theme that the malware files had been hidden, I eliminated that whole directory & all files, and configured the blog with the newest wordpress theme.

Around that time I got some communication from 1and1. As it turns out they had been notified by google as well. Makes sense.

Given the shortage of time, and my imminent vacation, I quickly called 1and1. As always their support team was there & easy to reach. This felt reassuring. I explained the issue, how it occurred and all the details of how the server & publishing system had been rebuillt from the ground up.

This was August 24th timeframe. As I had received emails about a potential lockout, I was reassured by the support specialist that the problem had been resolved to their satisfaction.

Read: Do managers underestimate operational cost?

3. Vacation implosion

I happily left for vacation knowing that all my hard work had been well spent.

Meantime around August 25th, 1and1.com sent me further emails asking me for “additional details”. Apparently the “I’m going on vacation” note had not made it to their security division. Another day goes by and since they received no email from me the server was locked!

Being locked, means it is completely unreachable. Totally offline. No bueno! That’s certainly frustrating, but websites do go down. What happened next was worse.

Since I use Mailchimp to host my newsletter, I write that well in advance each month. Just like clockwork the emails go out to my 1100 subscribers on September 1st. Many of those are opened & hundreds click on the link. And there they are faced with a blank screen & browser. Nothing. Zilch! Offline!

Also: Why I use Airbnb chat even when texting is easier

4. The aftermath

As I return to connectivity, I begin sifting through my emails. I receive quite a few from friends in colleagues explaining that they couldn’t view my newsletter. I immediately remember my conversation with 1and1, their assurances that the server won’t be locked out, and that all is well. I’m thinking “I bet that server got locked out anyway”. Damn it, I’m angry.

Taking a deep breath, I call up 1and1 and get on the line with a support tech. Being careful not to show my frustration, I explain the situation again. I also explain how my server was down for two weeks and how it was offline during a key moment when my newsletter goes out.

The tech is able to reach out to the security department & explain things again. Without any additional changes to my server or technical configuration they are then able to unlock the server. Sad proof of a beurocratic mixup if there ever was one.

Also: Is Amazon too big to fail?

5. Reflections on complexity

For me this example illustrates the complexity in modern systems. As the internet gets more & more complex, some argue that we are building a sort of house of cards. So many moving parts, so many vendors, so many layers of software & so many pieces to patch & update.

As things get more complex, their are more cracks for the hackers to exploit. And patching those up becomes ever more daunting.

Related: Are we fast approaching cloud-mageddon?

Get more. Grab our exclusive monthly Scalable Startups. We share tips and special content. Our latest Why I don’t work with recruiters

When fat fingers take down your business

apple sad mac fail

Join 14,000 others and follow Sean Hull on twitter @hullsean.

Github goes nuclear

I was flipping through reddit last night, and hit this crazy story. strange pushes on GitHub. For those who don’t know, github houses source code. It’s version control for the software world. Lots of projects use it, to keep track of change management.

Jenkins is a continuous integration platform. Someone working on the project accidentally did a force push up to the server. They overwrote not only their own work, but the work of hundreds of other plugins unrelated to his own project.

This is like doing a demolition to put up a new building, and taking down all the buildings on your block and the next. Not very neighborly, to say the list. They’re still at the time of this writing, doing cleanup, and digging through the rubble.

Read: Why DynamoDB can increase availability

How to kill a database

I worked a startup a few years back that had an interesting business model. Users would sit and watch videos, and get paid for their time. Watch the video, note the code, enter the code, earn cash. Somehow the advertisers had found a way to make this work.

The whole infrastructure ran on Amazon EC2 servers, and was managed by Rightscale. Well it was actually managed by an west coast outsourcing shop, whose specialty was managing deployments on Righscale.

The site kept it’s information in a MySQL database. They had various scripts to spinup slaves, remaster, switch roles and so forth. Of course MySQL can be finicky and is prone to throwing surprises your way from time to time.

One time this automation failed in a big way, switching over production customers to a database that took way way way too long to rebuild. As their automation didn’t perform checksums to bulletproof the setup it couldn’t know that all the data wasn’t finished moving!

Customers sure did notice though when the site fell over. Yes this was a failure of automation. But not of the Rightscale platform, but of the outsourcing firm managing the process, checking the pieces and components and ensuring the computer systems did their thing to completion. Huge fail!

Read: Why devops talent is in short supply

Your website will fail

Sites big and small fail. Hopefully these stories illustrate that fact. I’ve said over and over why perfect availability is a pipe dream.

At the end of the day, the difference between the successful sites and the sloppy ones isn’t failure and perfection. It’s *how* they fail, and how they get back up on their feet. What type of planning did they do for disaster recovery like many firms in NYC did before and after Sandy.

Also: Why startups need both devs and ops for scalability

Reducing failure

So instead of thinking about eliminating failure, let’s think about *reducing* it from happening, and when it does, reducing the fallout. One thing you can do is signup for scalable startups where we share tips once a month on the topic. Meanwhile try to put these best practices into play.

1. Test your DR plan by running real life fire drills
2. Use more than one hosting provider, data center or cloud provider
3. Give each op or end user the least privileges they need to do their job
4. Embrace a culture of caution in operations
5. Check, double check and triple check those fat fingers!

Read this: Why a four letter word divides dev and ops

Get more. Grab our exclusive monthly Scalable Startups. We share tips and special content. Our latest Why I don’t work with recruiters

Road War Story – Hacking Inflight Solutions


The 2am phone call

Last summer I got my call from the president at 2am.  Actually it was my former boss at Hollywood Reporter.  I had worked there three months previous, and they had since hired an outsourced DBA solution.  Big outsource, big chops.  And big fail.



12 hours to liftoff

I was scrambling to pack my luggage to go on summer vacation.  I was bound for SF at the moment and my flight was leaving in the morning.  I was trying to wrap up loose ends and my former boss was entreating me – “Can you help us?  Our replication setup has just melted down.  We need you to cleanup the mess.”

The so-called pain point

After a few more early am Skype calls and chats, the team retired for the night and I finished packing my bags.   I snuck in an hour of sleep then headed straight for the airport.  Once through airport security, I bust out my laptop and start logging into the servers.

Although the exact cause of the replication failure remained opaque, I was asked to scan both databases and determine differences.  Out of my toolbox comes the perfect tool for the job, pt-table-checksum, and I run scans on both databases.  (For the curious, here is how) I find countless records different between the two databases.

Now my flight is boarding, so I pack up the laptop and find my seat.  As soon as the seat belt lights flash off, I’m flipping open my macbook at getting inflight wifi working.  Through the flight I’m on SKYPE with the team, with command line terminals open to the servers.  Discuss, debug, troubleshoot – rinse, repeat.

From there I write up a report and explain to the team & CTO the problem.  Syncing that many different records is too risky.  We’d have to review all the statements one-by-one.  I’d rather rebuild replication from scratch.

From there the CTO gives the go ahead, and with the help of Percona’s xtrabackup to do online hotbackups, we are able to fix replication without downtime. Amen to that!

Now with our primary MySQL database and secondary read-only one back online, things calm down a lot.  Traffic returns to a smooth predictable 2 million pageviews per day.  That’s smooth and predictable on a site that gets 50 million a month!   The database loads are calm and steady, as our all of our nerves.   In the coming days we continue to monitor the situation, and write up lengthly root cause analysis of the situation.

Freelancers & Consultants take note

To my recent Consulting 101 article I would add the following bullets:

  • Responsiveness is crucial
  • Be there when a client needs you, and your value goes up.  Be reliable, and loyal to those you’ve worked with.

  • Be an integral part of your team
  • Everyone knows eachother virtual or in real life, and are comfortable with the parts they play.  A team that can work together is crucial, whether it’s all fulltime folks, some consultants, some outsourced or wherever they may be.  Each has a role to play, and communication and team work brings it all together.

  • Have laptop will travel
  • I never turn down a job.  There will be plenty of time for vacations and rest when the dust settles.

  • Don’t break things
  • If there is any doubt in your mind, test, and test again.  Always err on the side of caution.  Check thrice and cut once!  If you haven’t done an operation ten, twenty or fifty times before, experiment a few more times with options to be sure.  And most importantly, if you don’t login to the systems you’re working on regularly, you better make damn sure you’re on the right box, flipping the right switch, and moving the right dials.  With modern internet infrastructure, there are a hundred ways to push the wrong red button!

    CTOs and Directors of Operations take note

  • Small & Nimble wins the day
  • I’ve used this value proposition before when speaking to prospects.  You can hire a big firm, and be a small fish to them.  Small fish means you’re gonna get less attention.  OR you can hire a small firm or contractor.  Then you’ll be a big fish to him or her.  Guess what?  If you’re their big fish, they’re gonna pay extra attention to every move they make, and ensure things don’t break.  They can’t afford mistakes, not to their reputation or their bottom line.  Not like the big boys can.

  • Choose passionate, yet conservative & risk averse operations folks
  • In developers you’re building technology, features, and forging ahead into new solutions.  The role is more to create waves, and break barriers.  How can we enable new business processes and so forth?

    In hiring operations personnel you want stability.  Look for individuals who are more risk averse.  This conservative streak is a countering force.  Ops teams are tasked with that job of bringing a steady state to your business services.  They don’t want to wake up at 2am in the morning.