Tag Archives: disaster recovery

When fat fingers take down your business

apple sad mac fail

Join 14,000 others and follow Sean Hull on twitter @hullsean.

Github goes nuclear

I was flipping through reddit last night, and hit this crazy story. strange pushes on GitHub. For those who don’t know, github houses source code. It’s version control for the software world. Lots of projects use it, to keep track of change management.

Jenkins is a continuous integration platform. Someone working on the project accidentally did a force push up to the server. They overwrote not only their own work, but the work of hundreds of other plugins unrelated to his own project.

This is like doing a demolition to put up a new building, and taking down all the buildings on your block and the next. Not very neighborly, to say the list. They’re still at the time of this writing, doing cleanup, and digging through the rubble.

Read: Why DynamoDB can increase availability

How to kill a database

I worked a startup a few years back that had an interesting business model. Users would sit and watch videos, and get paid for their time. Watch the video, note the code, enter the code, earn cash. Somehow the advertisers had found a way to make this work.

The whole infrastructure ran on Amazon EC2 servers, and was managed by Rightscale. Well it was actually managed by an west coast outsourcing shop, whose specialty was managing deployments on Righscale.

The site kept it’s information in a MySQL database. They had various scripts to spinup slaves, remaster, switch roles and so forth. Of course MySQL can be finicky and is prone to throwing surprises your way from time to time.

One time this automation failed in a big way, switching over production customers to a database that took way way way too long to rebuild. As their automation didn’t perform checksums to bulletproof the setup it couldn’t know that all the data wasn’t finished moving!

Customers sure did notice though when the site fell over. Yes this was a failure of automation. But not of the Rightscale platform, but of the outsourcing firm managing the process, checking the pieces and components and ensuring the computer systems did their thing to completion. Huge fail!

Read: Why devops talent is in short supply

Your website will fail

Sites big and small fail. Hopefully these stories illustrate that fact. I’ve said over and over why perfect availability is a pipe dream.

At the end of the day, the difference between the successful sites and the sloppy ones isn’t failure and perfection. It’s *how* they fail, and how they get back up on their feet. What type of planning did they do for disaster recovery like many firms in NYC did before and after Sandy.

Also: Why startups need both devs and ops for scalability

Reducing failure

So instead of thinking about eliminating failure, let’s think about *reducing* it from happening, and when it does, reducing the fallout. One thing you can do is signup for scalable startups where we share tips once a month on the topic. Meanwhile try to put these best practices into play.

1. Test your DR plan by running real life fire drills
2. Use more than one hosting provider, data center or cloud provider
3. Give each op or end user the least privileges they need to do their job
4. Embrace a culture of caution in operations
5. Check, double check and triple check those fat fingers!

Read this: Why a four letter word divides dev and ops

Get more. Grab our exclusive monthly Scalable Startups. We share tips and special content. Our latest Why I don’t work with recruiters

Root Cause Analysis – What is it and why is it important?

Root Cause Analysis is the means to identify the ultimate source and cause of an outage.  When an outage occurs that causes serious downtime of a website, typically organizations are in crisis mode.  Urgency of resolution sometimes pushes aside due process, change management and general caution.  Root Cause Analysis attempts to as much as possible isolate logfiles, configurations, and the current state of systems for later analysis.

With traditional physical servers, physical hardware failure, operator error, or a security breach can cause outages.  Since you’re dealing with one physical machine, resolving that issue necessarily means moving around the things that broke.  So caution and later analysis must be balanced with the immediate problem resolution.

Another silver lining in cloud hosted solutions is around root cause analysis.  If a server was breached for example, that server can immediately be shutdown, while maintaining it’s current state as a disk or EBS snapshot.  A new server can then be fired up from a AMI image, then your server rebuilt from scripts or template and you’re back up and running.  Save the snapshot then for later analysis.

This could be used for analysis of operator error related outages as well.  Hardware failures are more expected and common in cloud hosted environments, so this should and really must push adoption of best practices around infrastructure, that is having scripts at hand that rebuild everything from bare metal.

More discussion of root cause analysis by Sean Hull on Quora.

Offsite Backups – What are they and why are they important?

Backups are obviously an important part of any managed infrastructure deployment.  Computing systems are inherently fallible, through operator error or hardware failure.  Existing systems must be backed up, from configurations, software and media files, to the backend data store.

In a managed hosting environment or cloud hosting environment, it is convenient to use various filesystem snapshot technologies to perform backups of entire disk volumes in one go.  These are powerful, fast, reliable, and easy to execute.  In Amazon EC2 for example these EBS snapshots are stored on S3.  But what happens if your data center goes down – through network outage or power failure?  Or further what happens if S3 goes offline?  Similar failures can affect traditional managed hosting facilities as well.

This is where offsite backups come in handy.  You would the be able to rebuild your application stack and infrastructure despite your entire production servers being offline.  That’s peace of mind!  Offsite backups can come in many different flavors:

  • mysqldump of the entire database, performed daily and copied to alternate hosting facility
  • semi-synchronous replication slave to alternate datacenter or region
  • DRBD setup – distributed filesystem upon which your database runs
  • replicated copy of version control repository – housing software, documentation & configurations

Offsite backups can also be coupled with a frequent sync of the binlog files (transaction logs).  These in combination with your full database dump will allow you to perform point-in-time recovery to the exact point the outage began, further reducing potential data loss.

Offsite Backups – What are they – discussed on Quora by Sean Hull

Infrastructure Provisioning – What is it and why is it important?

In the old days…

You would have a closet in your startup company with a rack of computers.  Provisioning involved:

  1. Deciding on your architectural direction, what, where & how
  2. Ordering the new hardware
  3. Waiting weeks for the packages to arrive
  4. Setup the hardware, wire things together, power up
  5. Discover some component is missing, or failed and order replacement
  6. Wait longer…
  7. Finally get all the pieces setup
  8. Configure software components and go

Along came some industrious folks who realized power and data to your physical location wasn’t reliable.  So datacenters sprang up.  With data centers, most of the above steps didn’t change except between steps 3 & 4 you would send your engineers out to the datacenter location.  Trips back and forth ate up time and energy.

Then along came managed hosting.  Managed hosting saved companies a lot of headache, wasted man hours, and other resources.  They allowed your company to do more of what it does well, run the business, and less on managing hardware and infrastructure.  Provisioning now became:

  1. Decide on architecture direction
  2. Call hosting provider and talk to sales person
  3. Wait a day or two
  4. Setup & configure software components and go

Obviously this new state of affairs improved infrastructure provisioning dramatically.  It simplified the process and sped it up as well.  What’s more a managed hosting provider could keep spare parts and standard components on hand in much greater volume than a small firm.  That’s a big plus.  This evolution continued because it was a win-win for everyone.  The only downside was when engineers made mistakes, and finger pointing began.  But despite all of that, a managed hosting provider which does only that, can do it better, and more reliably than you can yourself.

So where are we in present day?  We are all either doing, or looking out cloud provisioning of infrastructure.  What’s cloud provisioning?  It is a complete paradigm shift, but along the same trajectory as what we’ve described above.  Now you removed all the waiting.  No waiting for sales team, or the ordering process.  That’s automatic.  No waiting for engineers to setup the servers, they’re already setup.  They are allocated by your software and scripts.  Even the setup and configuration of software components, Operating System and services to run on that server – all automatic.

This is such a dramatic shift, that we are still feeling the affects of it.  Traditional operations teams have little experience with this arrangement, and perhaps little trust in virtual servers.  Business units are also not used to handing the trigger to infrastructure spending over to ops teams or to scripts and software.

However the huge economic pressures continue to push firms to this new model, as well as new operational flexibility.  Gartner predicts this trend will only continue. The advantages of cloud infrastructure provisioning include:

  1. Metered payment – no huge outlay of cash for new infrastructure
  2. Infrastructure as a service – scripted components automate & reduced manual processes
  3. Devops – Manage infrastructure like code with version control and reproduceability
  4. Take unused capacity offline easily & save on those costs
  5. Disaster Recovery is free – reuse scripts to build standard components
  6. Easily meet seasonal traffic requirements – spinup additional servers instantly

On Quora Sean Hull asks – What is infrastructure provisioning and why is it important?

Business Continuity Planning – What is it and why is it important?

BCP or BCRP if you want to also include “resiliency” in the acronym, basically outlines planning for the worst.  In the old days you had a filing cabinet with documents, for example there might be a central government office which houses birth certificates or titles and deeds.  Perhaps a copy of documents is regularly created, and kept offsite in case of fire.

In the digital era, more and more companies have most of their assets in digital form.  So the lifespan of those digital copies, and the contents thereof must be protected.  Since running computing systems and data centers always involves operational risk, planning for the worst is the essence of business continuity planning.

For example you may have redundant switches, routers and networking cables, but what if the whole data center goes down?  Do you have offsite backups that you’re creating regularly?  When was the last time those backups were tested with a full application restore and fire drill?

Discussed by Sean Hull on Quora – What is BCP and why is it important?

Zero Downtime – What is it and why is it important?

For most large web applications, uptime is of foremost importants.  Any outage can be seen by customers as a frustration, or opportunity to move to a competitor.  What’s more for a site that also includes e-commerce, it can mean real lost sales.

Zero Downtime describes a site without service interruption.  To achieve such lofty goals, redundancy becomes a critical requirement at every level of your infrastructure.  If you’re using cloud hosting, are you redundant to alternate availability zones and regions?  Are you using geographically distributed load balancing?  Do you have multiple clustered databases on the backend, and multiple webservers load balanced.

All of these requirements will increase uptime, but may not bring you close to zero downtime.  For that you’ll need thorough testing.  The solution is to pull the trigger on sections of your infrastructure, and prove that it fails over quickly without noticeable outage.  The ultimate test is the outage itself.

Sean Hull on Quora: What is zero downtime and why is it important?

Backups – What are they and why are they important?

Backups are obviously a crucial component in any enterprise application.  Modern internet components are prone to failure, and backups keep your bases covered.  Here’s what you should consider:

  1. Is your database backed up, including object structures, data, stored procedures, grants, and logins?
  2. Is your webserver doc-root backed up?
  3. Is your application source code in version control and backed up?
  4. Are your server configurations backed up?  Relevant config files might include those for apache, mysql, memcache, php, email (postfix or qmail), tomcat, Java solr or any other software your application requires.
  5. Are your cron or supporting scripts and jobs backed up?
  6. Have you tested all of these components and your overall documentation with a fire drill?  This is the proof that you’ve really covered all the angles.

If you do your backups right, you should be able to restore without a problem.

Sean Hull asks on Quora – What are backups and why are they important?

What is disaster recovery and why is it important?

Disaster recovery involves the anticipation of major business outage, and the contingency planning to avoid business loss in revenue, customers or sales.

All of the technology components that make up your enterprise applications should be carefully considered against loss.  What happens if this database server disappears?  Do we have all the data backed up somewhere?  Have we tested that backup to restore it?  How long does it take to restore?  Can we reconnect the application to said database?  What if the network goes down?  How about if the whole datacenter goes out?

Planning for disaster recovery is important whether you’re hosted in-house or with a hosting provider.  Consider Amazon’s EC2 outage in April.  Various availability zones went out.  Were affected customers to have their database backed up properly – with offsite & tested copies, and further if they had other components such as webserver document roots, software configurations, etc they would be able to rebuild their entire infrastructure in an alternate availability zone or region.  Remember it was only a small component of Amazon Web Services which was out.

Sean Hull asks on Quora: Disaster Recovery – What is it and why is it important?

iHeavy Insights 79 – Plumbing the Interwebs

I meet new people all the time.  It’s a way of life in New York.  One of the first questions new people ask each other is “What do you do?”.  It begins to sound like a cliche after a while, but it can also provide endless fascinating discussions as there are so many people with different professions in New York.  Some choose a titled answer “i’m an investment banker”, “I’m an emcee”, “I’m an executive recruiter”.  I find for “Web Scalability Consultant” or “Web Operations Expert” this only leaves confused looks.

A Plumber By Another Name

The solution of course is to tell a good story.  Stories illustrate what titles and crusty vernacular cannot.  I’ve used analogies to surgeons or mechanics, of course they all operate on something people can related to in front of them.  People or vehicles we use everyday.  Of course with the internet, there is a huge hidden infrastructure that most people don’t see everyday.  They may vaguely know it’s there, but it’s still hidden out of site.

That’s why I think plumbing provides such an apt visual.  As it turns out the internet is built with countless data pipes both large and small, coming into your home or laying across the bottom of the transatlantic ocean.  These pipes plug into routers, high speed traffic lights and traffic cops.  Ultimately they feed into datacenters, huge rooms filled with racks of computers, holding your websites crown jewels.  Therein contains the images and status updates from your facebook profile, your banking transactions from your personal bank account or credit card, your netflix movie stream, or the email you sent via gmail.  Even your instant messaging stream, or the data from your favorite iphone app are all stored and retrieved from here.

Amazon Outage

The recent Amazon outage has been high profile enough that a lot of folks who don’t follow the latest trends in web operations, devops, and datacenter automation still heard about this event.  Turns out it’s had a silver lining for Amazon cause now everyone is scrutinizing how many sites actually rely on this goliath of a hosting provider.

As it turns out the root of the amazon outage was indeed a plumbing problem.  Amazon has shown rather high transparency publishing intimate details of the problem and it’s resolution.  Read more.

A misconfigured network cascaded through the system creating countless failures.  If you imagine water repairs being done in a large New York City building, they often ask tenants to turn off their water, so they won’t all come on at the same time when service is restored.  SImilarly intricate problems complicated the Amazon effort, slowing down attempts to restore everything after the incident.  I wrote at length about the outage if you’re interested, read more.

BOOK REVIEW:  Game-Based Marketing by Zicherman & Linder

There are so many new books coming out all the time, it’s tough to sift and find the good ones.  Anyone with a website as their storefront, whether they are a product company or a services company, can gain from reading this book.

From leaderboards to frequent flyer programs, badges and more this book is full of real-world examples where game-based principles are put into action.  On the internet where attention is a rarer and rarer commodity, these concepts will surely make a big difference to your business.

Amazon book link – Game Based Marketing

Amazon EC2 Outage – Failures, Lessons and Cloud Deployments

Now that we’ve had a chance to take a deep breath after last week’s AWS outage, I’ll offer some comments of my own.  Hopefully just enough time has passed to begin to have a broader view, and put events in perspective.
Despite what some reports may have announced, Amazon wasn’t down, but rather a small part of Amazon Web Services went down.  A failure, yes.  Beyond their service level agreement of 99.95% yes also.  Survivable, yes to this last question too.

Learning From Failure

The business management conversation du jour is all about learning from failure, rather than trying to avoid it.  Harvard Business Review’s April issue headlined with “The Failure Issue – How to Understand It, Learn From It, and Recover From It”.  The economist’s April 16th issue had some similarly interesting pieces one by Schumpeter “Fail often, fail well”,
and another in April 23rd issue “Lessons from Deepwater Horizon and Fukushima”.
With all this talk of failure there is surely one takeaway.  Complex systems will fail and it is in the anticipation of that failure that we gain the most.  Let’s stop howling and look at how to handle these situations intelligently.

How Do You Rebuild A Website?

In the cloud you will likely need two things.  (a) scripts to rebuild all the components in your architecture, spinup servers, fetch source code, fetch software and configuration files, configure load balancers and mount your database and more importantly (b) a database backup from which you can rebuild your current dataset.

Want to stick with EC2, build out your infrastructure in an alternate availability zone or region and you’re back up and running in hours.  Or better yet have an alternate cloud provider on hand to handle these rare outages.  The choice is yours.

Mitigate risk?  Yes indeed failure is more common in the cloud, but recovery is also easier.  Failure should pressure the adoption of best practices and force discipline in deployments, not make you more of a gunslinger!

Want to see an extreme example of how this can play in your favor?  Read Jeff Atwood’s discussion of so-called Chaos Monkey, a component whose sole job it is to randomly kill off servers in the Netflix environment at random.  Now that type of gunslinging will surely keep everyone on their toes!  Here’s a Wired article that discusses Chaos Monkey.

George Reese of enStratus discusses the recent failure at length.  The I would argue calling Amazon’s outage the Cloud’s Shing Moment, all of his points are wisened and this is the direction we should all be moving.

Going The Way of Commodity Hardware

Though it is still not obvious to everyone, I’ll spell it out loud and clear.  Like it or not, the cloud is coming.  Look at these numbers.

Furthermore the recent outage also highlights how much and how many internet sites rely on cloud computing, and Amazon EC2.
Way back in 2001 I authored a book on O’Reilly called “Oracle and Open Source”.  In it I discussed the technologies I was seeing in the real world.  Oracle on the backend and Linux, Apache, and PHP, Perl or some other language on the frontend.  These were the technologies that startups were using.  They were fast, cheap and with the right smarts reliable too.

Around that time Oracle started smelling the coffee and ported it’s enterprise database to Linux.  The equation for them was simple.  Customers that were previously paying tons of money to their good friend and confidant Sun for hardware, could now spend 1/10th as much on hardware and shift a lot of that left over cash to – you guessed it Oracle!  The hardware wasn’t as good, but who cares because you can get a lot more of it.

Despite a long entrenched and trusted brand like Sun being better and more reliable, guess what?  Folks still switched to commodity hardware.  Now this is so obvious, no one questions it.  But the same trend is happening with cloud computing.

Performance is variable, disk I/O can be iffy, and what’s more the recent outage illustrates front and center, the servers and network can crash at any moment.  Who in their right mind would want to move to this platform?

If that’s the question you’re stuck on, you’re still stuck on the old model.  You have not truely comprehended the power to build infrastructure with code, to provision through automation, and really embrace managing those components as software.  As the internet itself has the ability to route around political strife, and network outages, so too does cloud computing bring that power to mom & pop web shops.


  • Have existing investments in hardware?  Slow and cautious adoption makes most sense for you.
  • Have seasonal traffic variations?  An application like this is uniquely suited to the cloud.  In fact some of the gaming applications which can autoscale to 10x or 100x servers under load, are newly solveable with the advent of cloud computing.
  • Are you currently paying a lot for disaster recovery systems that primarily lay idle.  Script your infrastructure for rebuilding from bare metal, and save that part of the budget for more useful projects.