The myth of five nines – Why high availability is overrated

nine_clock

Join 12,000 others and follow Sean Hull on Twitter @hullsean.

In the Internet world 24×7 has become the de facto standard. Websites must be always on, available 24 hours a day, 365 days a year. In our pursuit of perfection, performance is being measured down to three decimal places, that is being up 99.999% of the time; in short, five-nines

Just like a mantra, when repeated enough it becomes second nature and we don’t give the idea a second thought. We don’t stop to consider that while it may be generally a good thing to have, is five-nines necessary and is it realistic for the business?

Also: How to hire a developer that doesn’t suck

In my dealings with small businesses, I’ve found that the ones that have been around longer, and with more seasoned managers tend to take a more flexible and pragmatic view of the five-nines standard. Some even feel that periods of outages during off hours as – *gasp* – no problem at all! On the other hand it is a universal truth held by the next-big-idea startups that 24×7 is do or die. To them, a slight interruption in service will send the wrong signal to customers.

The sense I get is that businesses that have been around longer have more faith in their customers and are confident about what their customers want and how to deliver it.  Meanwhile startups who are building a customer base feel the need to make an impression and are thus more sensitive to perceived limitations in their service.

Of course the type of business you run might well inform your policy here. Short outages in payments and e-commerce sites could translate into lost revenue while perhaps a mobile game company might have a little more room to breathe.

Related: Why generalists are better at scaling the web

Sustaining five nines is too expensive for some

The truth is sustaining high availability at the standard of five-nines costs a lot of money. These costs are incurred from buying more servers, whether as physical infrastructure or in the cloud. In addition you’ll likely involve more software components and configuration complexity. And here’s a hard truth, with all that complexity also comes more risk.  More moving parts means more components that can fail. Those additional components can fail from bugs, misconfiguration, or interoperability issues.

What’s more, pushing for that marginal 0.009% increase in high availability means you’ll require more people and create more processes.

Read this: Why reddit didn’t have to fail

Complex architecture downtime

In a client engagement back in 2011, I worked with a firm in the online education space.  Their architecture was quite complex.  Although they had web servers and database servers—the standard internet stack—they did not have standardized operations.  So they had the Apache web server on some boxes, and Nginx on others.  What’s more they had different versions of each as well as different distributions of Linux, from Ubuntu to RedHat Enterprise Edition.  On the database side they had instances on various boxes, and since they weren’t all centralized they were not all being backed up.  During one simple maintenance operation, a couple of configurations were rearranged, bringing the site down and blocking e-commerce transactions for over an hour.  It wasn’t a failure of technology but a failure of people and processes made worse by the hazard of an overly complex infrastructure.

In another engagement at a financial media firm, I worked closely with the CTO outlining how we could architect an absolutely zero downtime infrastructure.  When he warned that “We have no room for *ANY* downtime,” alarm bells were ringing in my head already.

Also: Why RDS doesn’t support Maria DB or Percona

When I hear talk of five-nines, I hear marketing rhetoric, not real-world risk reduction.   Take for example the power grid outage that hit the Northeast in 2003.  That took out power from large swaths of the country for over 24 hours.  In real terms that means anyone hosted in the Northeast failed five-nines miserably because downtime for 24 hours would be almost 300 years of downtime at the five-nines standard!

For true high availability look at better management of processes

So what can we do in the real-world to improve availability?  Some of the biggest impacts will come from reducing so-called operator error, and mistakes of people and processes.

Before you think of aiming for five-nines,  first ask some of these questions:

o Do you test servers?
o Do you monitor logfiles?
o Do you have network wide monitoring in place?
o Do you verify backups?
o Do you monitor disk partitions?
o Do you watch load average?
o Do you monitor your server system logs for disk errors and warnings?
o Do you watch disk subsystem logs for errors? (the most likely component in hardware to fail is a disk)
o Do you have server analytics?  Do you collect server system metrics?
o Do you perform fire drills?
o Have you considered managed hosting?

If you’re thinking about and answering these questions you’re well on your way to improving availability and uptime.

Read this: Top MySQL interview questions for DBAs, hiring managers & recruiters

Want more? Grab our Scalable Startups monthly for more tips and special content. Here’s a sample