The myth of five nines – Why high availability is overrated


Join 12,000 others and follow Sean Hull on Twitter @hullsean.

In the Internet world 24×7 has become the de facto standard. Websites must be always on, available 24 hours a day, 365 days a year. In our pursuit of perfection, performance is being measured down to three decimal places, that is being up 99.999% of the time; in short, five-nines

Just like a mantra, when repeated enough it becomes second nature and we don’t give the idea a second thought. We don’t stop to consider that while it may be generally a good thing to have, is five-nines necessary and is it realistic for the business?

Also: How to hire a developer that doesn’t suck

In my dealings with small businesses, I’ve found that the ones that have been around longer, and with more seasoned managers tend to take a more flexible and pragmatic view of the five-nines standard. Some even feel that periods of outages during off hours as – *gasp* – no problem at all! On the other hand it is a universal truth held by the next-big-idea startups that 24×7 is do or die. To them, a slight interruption in service will send the wrong signal to customers.

The sense I get is that businesses that have been around longer have more faith in their customers and are confident about what their customers want and how to deliver it.  Meanwhile startups who are building a customer base feel the need to make an impression and are thus more sensitive to perceived limitations in their service.

Of course the type of business you run might well inform your policy here. Short outages in payments and e-commerce sites could translate into lost revenue while perhaps a mobile game company might have a little more room to breathe.

Related: Why generalists are better at scaling the web

Sustaining five nines is too expensive for some

The truth is sustaining high availability at the standard of five-nines costs a lot of money. These costs are incurred from buying more servers, whether as physical infrastructure or in the cloud. In addition you’ll likely involve more software components and configuration complexity. And here’s a hard truth, with all that complexity also comes more risk.  More moving parts means more components that can fail. Those additional components can fail from bugs, misconfiguration, or interoperability issues.

What’s more, pushing for that marginal 0.009% increase in high availability means you’ll require more people and create more processes.

Read this: Why reddit didn’t have to fail

Complex architecture downtime

In a client engagement back in 2011, I worked with a firm in the online education space.  Their architecture was quite complex.  Although they had web servers and database servers—the standard internet stack—they did not have standardized operations.  So they had the Apache web server on some boxes, and Nginx on others.  What’s more they had different versions of each as well as different distributions of Linux, from Ubuntu to RedHat Enterprise Edition.  On the database side they had instances on various boxes, and since they weren’t all centralized they were not all being backed up.  During one simple maintenance operation, a couple of configurations were rearranged, bringing the site down and blocking e-commerce transactions for over an hour.  It wasn’t a failure of technology but a failure of people and processes made worse by the hazard of an overly complex infrastructure.

In another engagement at a financial media firm, I worked closely with the CTO outlining how we could architect an absolutely zero downtime infrastructure.  When he warned that “We have no room for *ANY* downtime,” alarm bells were ringing in my head already.

Also: Why RDS doesn’t support Maria DB or Percona

When I hear talk of five-nines, I hear marketing rhetoric, not real-world risk reduction.   Take for example the power grid outage that hit the Northeast in 2003.  That took out power from large swaths of the country for over 24 hours.  In real terms that means anyone hosted in the Northeast failed five-nines miserably because downtime for 24 hours would be almost 300 years of downtime at the five-nines standard!

For true high availability look at better management of processes

So what can we do in the real-world to improve availability?  Some of the biggest impacts will come from reducing so-called operator error, and mistakes of people and processes.

Before you think of aiming for five-nines,  first ask some of these questions:

o Do you test servers?
o Do you monitor logfiles?
o Do you have network wide monitoring in place?
o Do you verify backups?
o Do you monitor disk partitions?
o Do you watch load average?
o Do you monitor your server system logs for disk errors and warnings?
o Do you watch disk subsystem logs for errors? (the most likely component in hardware to fail is a disk)
o Do you have server analytics?  Do you collect server system metrics?
o Do you perform fire drills?
o Have you considered managed hosting?

If you’re thinking about and answering these questions you’re well on your way to improving availability and uptime.

Read this: Top MySQL interview questions for DBAs, hiring managers & recruiters

Want more? Grab our Scalable Startups monthly for more tips and special content. Here’s a sample

  • Pingback: Verify myth | Myfeedbackonli

  • Pingback: Best of Guide - Highlights of Our Popular Content

  • Pingback: Rutweb Technology : Best of Guide – Highlights of Our Popular Content

  • Pingback: AirBNB didn't have to fail

  • Pingback: Beware Premature Certainty – Embracing Ambiguous Requirements | Form Follows Function

  • Marko Manninen

    Related to time table shown on picture, some might be interested to look at different numbers used for similar clock:

    • Sean Hull

      Thx Marko. Great graphics!

  • Pingback: The Myth of Five-Nines | AdatoSystems Monitoring

  • Annette with Windward

    Thank you for pointing out the holes in five nines.

    As a small company Windward cannot afford the bigger solutions to keep our website up. We tasked one of our engineers with finding a solution. He did, using three providers.

    Full disclosure, we are not affiliated with any of them. We’re just so thrilled that it is working that we put in a pay-it-forward type of webinar:

    • Sean Hull

      Thx Annette.

  • angelos

    A year has 365*24*60=525600 minutes, 99.9% uptime means that the dowtime may be 525.6 minutes or approx 9 hours in a year, at 99.99% a bit less than an hour (52.56 minutes) and at 99.999% you guessed it 5.256 minutes per year. Quite frankly for today’s networked applications where the errors come mostly from the network, maintaining five nines is just plain wishful thinking and wasteful.

    • Sean Hull

      Yes my sentiments exactly.

  • Stuart Buckell

    Sorry, I disagree with Angelos and Sean – five 9’s is not wasteful or wishful thinking at all. It is a combination of intelligent software design (SOA), and simple architectural design. In fact, the goal of achieving five 9’s forces the system and configuration to be more simple, not more complex. With the wide availability of MySQL cluster, cheap managed DNS, and open source load balancing tools, ensuring no single point of failure is no longer only in the hands of the enterprise with big budgets.

    • Sean Hull

      I hear you Stuart. It’s those pesky blackouts that throw a wrench into the works!

  • Alec Lazarescu

    Good post to kick off discussion. I would say not so much that high availability is overrated, but rather misunderstood. There are real cases where there’s value of the extra work. You did cover in the details more about the trade-offs and complexity risk to be considered. I had some similar conclusions and posted on my blog a fun anecdote about a failed attempt at high availability I ran into:

    • Sean Hull

      Thx for comments Alec. Just took a look at your post. Good stuff !