Tag Archives: availability

When fat fingers take down your business

apple sad mac fail

Join 14,000 others and follow Sean Hull on twitter @hullsean.

Github goes nuclear

I was flipping through reddit last night, and hit this crazy story. strange pushes on GitHub. For those who don’t know, github houses source code. It’s version control for the software world. Lots of projects use it, to keep track of change management.

Jenkins is a continuous integration platform. Someone working on the project accidentally did a force push up to the server. They overwrote not only their own work, but the work of hundreds of other plugins unrelated to his own project.

This is like doing a demolition to put up a new building, and taking down all the buildings on your block and the next. Not very neighborly, to say the list. They’re still at the time of this writing, doing cleanup, and digging through the rubble.

Read: Why DynamoDB can increase availability

How to kill a database

I worked a startup a few years back that had an interesting business model. Users would sit and watch videos, and get paid for their time. Watch the video, note the code, enter the code, earn cash. Somehow the advertisers had found a way to make this work.

The whole infrastructure ran on Amazon EC2 servers, and was managed by Rightscale. Well it was actually managed by an west coast outsourcing shop, whose specialty was managing deployments on Righscale.

The site kept it’s information in a MySQL database. They had various scripts to spinup slaves, remaster, switch roles and so forth. Of course MySQL can be finicky and is prone to throwing surprises your way from time to time.

One time this automation failed in a big way, switching over production customers to a database that took way way way too long to rebuild. As their automation didn’t perform checksums to bulletproof the setup it couldn’t know that all the data wasn’t finished moving!

Customers sure did notice though when the site fell over. Yes this was a failure of automation. But not of the Rightscale platform, but of the outsourcing firm managing the process, checking the pieces and components and ensuring the computer systems did their thing to completion. Huge fail!

Read: Why devops talent is in short supply

Your website will fail

Sites big and small fail. Hopefully these stories illustrate that fact. I’ve said over and over why perfect availability is a pipe dream.

At the end of the day, the difference between the successful sites and the sloppy ones isn’t failure and perfection. It’s *how* they fail, and how they get back up on their feet. What type of planning did they do for disaster recovery like many firms in NYC did before and after Sandy.

Also: Why startups need both devs and ops for scalability

Reducing failure

So instead of thinking about eliminating failure, let’s think about *reducing* it from happening, and when it does, reducing the fallout. One thing you can do is signup for scalable startups where we share tips once a month on the topic. Meanwhile try to put these best practices into play.

1. Test your DR plan by running real life fire drills
2. Use more than one hosting provider, data center or cloud provider
3. Give each op or end user the least privileges they need to do their job
4. Embrace a culture of caution in operations
5. Check, double check and triple check those fat fingers!

Read this: Why a four letter word divides dev and ops

Get more. Grab our exclusive monthly Scalable Startups. We share tips and special content. Our latest Why I don’t work with recruiters

The four-letter-word dividing Dev and Ops

devops divide

What’s that word?  RISK

Operations teams are tasked with stability and uptime. That means working against change, limiting or slowing it down where possible.

Developers are tasked with features and delivering business solutions. For that an ORM layer seems appealing for example. It speeds up & simplifies coding. At the same time it eliminates database drudgery.

For ops who are tasked with uptime, an ORM is a force against scalability. I’ve outlined five things toxic to scalability. They work against performance.

The question remains – do devops folks solve the problem?

Consider the banking crisis

Bankers are tasked with making money for their shareholders. To do this they innovate with financial products. Though you may argue they are unscrupulous at times, capitalism and shareholder value drive them to find profit.

Meanwhile the government’s job is to provide a level playing field.  They enact rules, regulate and provide oversight and auditing. As with operations, this is a conservative role, that avoids risk, and seeks stability, growth and avoidance of recessions and depressions.

These tradeoffs exist in many disciplines. The trick is how we find the balance.

There is an equally interesting question of decoupling in internet architectures. I’ll write a future piece on similar parallels I see in the economy at large.