Tag Archives: zero percent downtime

The myth of five nines – Why high availability is overrated

nine_clock

Join 12,000 others and follow Sean Hull on Twitter @hullsean.

In the Internet world 24×7 has become the de facto standard. Websites must be always on, available 24 hours a day, 365 days a year. In our pursuit of perfection, performance is being measured down to three decimal places, that is being up 99.999% of the time; in short, five-nines

Just like a mantra, when repeated enough it becomes second nature and we don’t give the idea a second thought. We don’t stop to consider that while it may be generally a good thing to have, is five-nines necessary and is it realistic for the business?

Also: How to hire a developer that doesn’t suck

In my dealings with small businesses, I’ve found that the ones that have been around longer, and with more seasoned managers tend to take a more flexible and pragmatic view of the five-nines standard. Some even feel that periods of outages during off hours as – *gasp* – no problem at all! On the other hand it is a universal truth held by the next-big-idea startups that 24×7 is do or die. To them, a slight interruption in service will send the wrong signal to customers.

The sense I get is that businesses that have been around longer have more faith in their customers and are confident about what their customers want and how to deliver it.  Meanwhile startups who are building a customer base feel the need to make an impression and are thus more sensitive to perceived limitations in their service.

Of course the type of business you run might well inform your policy here. Short outages in payments and e-commerce sites could translate into lost revenue while perhaps a mobile game company might have a little more room to breathe.

Related: Why generalists are better at scaling the web

Sustaining five nines is too expensive for some

The truth is sustaining high availability at the standard of five-nines costs a lot of money. These costs are incurred from buying more servers, whether as physical infrastructure or in the cloud. In addition you’ll likely involve more software components and configuration complexity. And here’s a hard truth, with all that complexity also comes more risk.  More moving parts means more components that can fail. Those additional components can fail from bugs, misconfiguration, or interoperability issues.

What’s more, pushing for that marginal 0.009% increase in high availability means you’ll require more people and create more processes.

Read this: Why reddit didn’t have to fail

Complex architecture downtime

In a client engagement back in 2011, I worked with a firm in the online education space.  Their architecture was quite complex.  Although they had web servers and database servers—the standard internet stack—they did not have standardized operations.  So they had the Apache web server on some boxes, and Nginx on others.  What’s more they had different versions of each as well as different distributions of Linux, from Ubuntu to RedHat Enterprise Edition.  On the database side they had instances on various boxes, and since they weren’t all centralized they were not all being backed up.  During one simple maintenance operation, a couple of configurations were rearranged, bringing the site down and blocking e-commerce transactions for over an hour.  It wasn’t a failure of technology but a failure of people and processes made worse by the hazard of an overly complex infrastructure.

In another engagement at a financial media firm, I worked closely with the CTO outlining how we could architect an absolutely zero downtime infrastructure.  When he warned that “We have no room for *ANY* downtime,” alarm bells were ringing in my head already.

Also: Why RDS doesn’t support Maria DB or Percona

When I hear talk of five-nines, I hear marketing rhetoric, not real-world risk reduction.   Take for example the power grid outage that hit the Northeast in 2003.  That took out power from large swaths of the country for over 24 hours.  In real terms that means anyone hosted in the Northeast failed five-nines miserably because downtime for 24 hours would be almost 300 years of downtime at the five-nines standard!

For true high availability look at better management of processes

So what can we do in the real-world to improve availability?  Some of the biggest impacts will come from reducing so-called operator error, and mistakes of people and processes.

Before you think of aiming for five-nines,  first ask some of these questions:

o Do you test servers?
o Do you monitor logfiles?
o Do you have network wide monitoring in place?
o Do you verify backups?
o Do you monitor disk partitions?
o Do you watch load average?
o Do you monitor your server system logs for disk errors and warnings?
o Do you watch disk subsystem logs for errors? (the most likely component in hardware to fail is a disk)
o Do you have server analytics?  Do you collect server system metrics?
o Do you perform fire drills?
o Have you considered managed hosting?

If you’re thinking about and answering these questions you’re well on your way to improving availability and uptime.

Read this: Top MySQL interview questions for DBAs, hiring managers & recruiters

Want more? Grab our Scalable Startups monthly for more tips and special content. Here’s a sample

Newsletter 72 – Don't Over Engineer

It’s not a caution you here very often, but a worthy one now as ever.  Don’t Over Engineer solutions to your problems, features in your product, moving parts in your infrastructure, or solutions for your customers.

Five Levels of Settings

Years ago during the dot-com boom we were involved with a project for a financial services firm.  We were building out a subscription-based web service for them.  As part of the requirements gathering, we discussed various components and features that the site should have.  Their vision included various levels of settings and customizations that the user could control to filter and tune presentation.  It all appeared very rube goldbergian to us.

Our suggestion was to drastically reduce the initial settings, simplify the process, shorten development and then find out what real customers wanted.  In this case the client is always right, so we went ahead and built out the complex settings scheme.  Once the product rolled out, however customers indeed had much different usage patterns than either our development team or the client had even expected.  Their demands in turn drove a different direction, but one matching real-world requirements that only the customers ultimately understood.

Features But No Customers

A colleague of mine is in the process of building out software for a web service.  Currently they’ve built version one of the product as they envisioned.  They have no customers.  They’re in the process of reviewing the product and deciding on a second round of new features to add.  Let me repeat the earlier part – they have no customers.

I asked my colleague, why not launch with it as it is, and then see what your customers want.  Their response – we don’t want to launch until we’re ready.

Well chances are you’ll never be ready because you don’t know where you’re going.  You can’t really know where you’re going until your customers tell you.

Zero Percent Downtime

I’ve spoken with many managers and CEOs about infrastructure and architecture over the years.  I remember one instance where I was asking about expectations, and downtime.  The managers response – we want zero downtime.  Well that’s not necessarily practical in the real-world.  Well let’s put everything in place we can to get as close as we can to that.

The real world is messier.  Data centers have power outages.  In fact the east coast had over a day of power loss.  Averaged out that is one hour per year over the past thirty years.  Adding more components, more software to detect anomalies, more redundancy behind redundancy has it’s own commensurate costs.  At a certain point you have to err on the side of simplicity, as ultimately the complexity of the system itself contributes to outages and downtime.

Evolution of a Company

iPhone applications are everywhere these days.  A couple of years ago we were gathering feedback and opinion from experts and investors about a concept we had to build a venue and event management platform.  A colleague put us in touch with the CEO of a company building a iphone platform.  After discussing our concept he explained that they had started out with a very similar concept.  But over the past two years their company had evolved quite a bit from that starting point.  They simply responded to their customers requests for features, and grew organically from their.

Conclusion

It’s not easy to engineer a perfect widget from the start.  You usually don’t know what customers want, or how they will use your product or service.  Or further you don’t always know what the real world will deliver up.  So it is very easy to over engineer it, and miss the mark.  Better to build less, build small and release early.  Then let your customers or real-world dictates decide your next move.

Book Review – The Four Agreements – A Practical Guide To Personal Wisdom by Don Miguel Ruiz

This small little book is full of some very big ideas.

  1. Be Impeccable With Your Word.
  2. Don’t Take Anything Personally.
  3. Don’t Make Assumptions.
  4. Always Do Your Best.

Whether you find personal insights, or help in your business relationships, this book will surely give you some fresh perspectives.