The myth of five nines – Why high availability is overrated

nine_clock

Join 12,000 others and follow Sean Hull on Twitter @hullsean.

In the Internet world 24×7 has become the de facto standard. Websites must be always on, available 24 hours a day, 365 days a year. In our pursuit of perfection, performance is being measured down to three decimal places, that is being up 99.999% of the time; in short, five-nines

Just like a mantra, when repeated enough it becomes second nature and we don’t give the idea a second thought. We don’t stop to consider that while it may be generally a good thing to have, is five-nines necessary and is it realistic for the business?

Also: How to hire a developer that doesn’t suck

In my dealings with small businesses, I’ve found that the ones that have been around longer, and with more seasoned managers tend to take a more flexible and pragmatic view of the five-nines standard. Some even feel that periods of outages during off hours as – *gasp* – no problem at all! On the other hand it is a universal truth held by the next-big-idea startups that 24×7 is do or die. To them, a slight interruption in service will send the wrong signal to customers.

The sense I get is that businesses that have been around longer have more faith in their customers and are confident about what their customers want and how to deliver it.  Meanwhile startups who are building a customer base feel the need to make an impression and are thus more sensitive to perceived limitations in their service.

Of course the type of business you run might well inform your policy here. Short outages in payments and e-commerce sites could translate into lost revenue while perhaps a mobile game company might have a little more room to breathe.

Related: Why generalists are better at scaling the web

Sustaining five nines is too expensive for some

The truth is sustaining high availability at the standard of five-nines costs a lot of money. These costs are incurred from buying more servers, whether as physical infrastructure or in the cloud. In addition you’ll likely involve more software components and configuration complexity. And here’s a hard truth, with all that complexity also comes more risk.  More moving parts means more components that can fail. Those additional components can fail from bugs, misconfiguration, or interoperability issues.

What’s more, pushing for that marginal 0.009% increase in high availability means you’ll require more people and create more processes.

Read this: Why reddit didn’t have to fail

Complex architecture downtime

In a client engagement back in 2011, I worked with a firm in the online education space.  Their architecture was quite complex.  Although they had web servers and database servers—the standard internet stack—they did not have standardized operations.  So they had the Apache web server on some boxes, and Nginx on others.  What’s more they had different versions of each as well as different distributions of Linux, from Ubuntu to RedHat Enterprise Edition.  On the database side they had instances on various boxes, and since they weren’t all centralized they were not all being backed up.  During one simple maintenance operation, a couple of configurations were rearranged, bringing the site down and blocking e-commerce transactions for over an hour.  It wasn’t a failure of technology but a failure of people and processes made worse by the hazard of an overly complex infrastructure.

In another engagement at a financial media firm, I worked closely with the CTO outlining how we could architect an absolutely zero downtime infrastructure.  When he warned that “We have no room for *ANY* downtime,” alarm bells were ringing in my head already.

Also: Why RDS doesn’t support Maria DB or Percona

When I hear talk of five-nines, I hear marketing rhetoric, not real-world risk reduction.   Take for example the power grid outage that hit the Northeast in 2003.  That took out power from large swaths of the country for over 24 hours.  In real terms that means anyone hosted in the Northeast failed five-nines miserably because downtime for 24 hours would be almost 300 years of downtime at the five-nines standard!

For true high availability look at better management of processes

So what can we do in the real-world to improve availability?  Some of the biggest impacts will come from reducing so-called operator error, and mistakes of people and processes.

Before you think of aiming for five-nines,  first ask some of these questions:

o Do you test servers?
o Do you monitor logfiles?
o Do you have network wide monitoring in place?
o Do you verify backups?
o Do you monitor disk partitions?
o Do you watch load average?
o Do you monitor your server system logs for disk errors and warnings?
o Do you watch disk subsystem logs for errors? (the most likely component in hardware to fail is a disk)
o Do you have server analytics?  Do you collect server system metrics?
o Do you perform fire drills?
o Have you considered managed hosting?

If you’re thinking about and answering these questions you’re well on your way to improving availability and uptime.

Read this: Top MySQL interview questions for DBAs, hiring managers & recruiters

Want more? Grab our Scalable Startups monthly for more tips and special content. Here’s a sample

How to hire a developer that doesn’t suck

xkcd_goodcode
Strip by Randall Munroe; xkcd.com

First things first. This is not meant to be a beef against developers. But let’s not ignore the elephant in the living room that is the divide between brilliant code writers and the risk averse operations team.

By the way we also have a MySQL DBA Interview Questions article which is quite popular.

Also take a look at our AWS & EC2 Interview questions piece.

Lastly we have a great Oracle DBA Hiring Guide.

It is almost by default that developers are disruptive with their creative coding while the guys in operations, those who deploy the code, constantly cross their fingers in the hope that application changes won’t tilt the machine. And when you’re woken up at 4am to deal with an outage or your sluggish site is costing millions in losses, the blame game and finger-pointing starts.

Also: Does Amazon have a dirty little secret?

If you manage a startup you may be faced with this problem all the time. You know your business, you know what you’re trying to build but how do you find people who can help you build and execute your ideas with minimal risk?

Join 38,000 others and follow Sean Hull on twitter @hullsean.

Ideally, you want people who can bridge the mentality divide between the programmers eager to see feature changes, the business units pushing for them, and the operations team resistant to changes for the sake of stability.

DevOps – Why can’t we all live together?

The DevOps movement is an attempt to bring all these folks together. For instance by providing insight to developers about the implications of their work on performance and availability, they can better balance the onslaught of feature demands from users with the business’ need for up-time.

Also: The art of resistence or when you have to be the bad guy

Operations teams can work to expose operational data to the development teams.  Metrics collection and analytics aren’t just for the business units anymore. Employing tools like Cacti, OpenNMS or Ganglia allow you to communicate with developers and other business units alike about up-time, and the impact of deployments on site availability, and ultimately the bottom-line.

Above all, business goals and customer needs should underscore everything the engineering team is doing. Bringing all three to the table makes for a more cohesive approach that will carry everyone forward.

How to spot a DevOps person – Finding the sweet spot

The DevOps person is someone with the right combination of skill, knowledge and experience that places him or her in the sweet spot where quality assurance, programming skills and operations overlap.
There are also a few distinguishing characteristics that will help identify such an ideal candidate.

We also wrote a more general piece – What is Devops and why is it important.

Look for good writers and communicators

Imagine the beads of sweat forming when a developer tells you: “We’ve made the changes. Nothing is broken yet.”
This is like stepping on glass because it implies something will actually break. The point is savvy developers should be aware that the majority of people do not think along the same lines as they do.
Assuming your candidate has all the required technical skills, a programmer with writing skills tends to be better at articulating ideas and methods coherently. He or she would also be less resistant to documentation and be able to step back somewhat from the itty-bitty details. Communication, afterall is at the core of the DevOps culture where different sides attempt to understand each other.

Related: What’s the 4 letter word dividing dev & ops?

Pick good listeners

Even rarer than good writers are good listeners. Being able to hear what someone else is saying, and reiterate it in their own terms is a key important quality. In our example, the good listener would probably have translated ‘nothing is broken yet’ into “the app is running smoothly. We didn’t encounter any interruptions but we’ll keep watch on things.”

Lean towards pragmatists and avoid the fanatics

We all want people who are passionate about something but when that passion morphs into fanaticism it can be unpleasant. Fanaticism suggests a lower propensity to compromise. Such characters are very difficult to negotiate with. Analogously in tech, we see people latch on to a certain standard with unquestioning loyalty that’s bafflingly irrational. Someone who has had their hand in many different technologies is more likely to be technology agnostic, or rather, pragmatic. They’ll also have a broader perspective, and are able to anticipate how those technologies will play together.  Furthermore a good sense of where things will run smoothly and where there will be friction is vital.

Related: When you have to take the fall

Pay attention to extra-curricular activities

Look at technology interests, areas of study, or even outside interests. Does the person have varying interests and can converse about different topics?  Do they tell stories, and make analogies from other disciplines to make a point?  Do they communicate in jargon-free language you can understand?

Sniff out those hungry for success

As with any role, finding someone who is passionate and driven is important.  Are they on-time for appointments? Did they email you the information you requested? Are they prepared and communicative?  Are they eager to get started?

Hiring usually focuses on skills and very well-crafted resumés but why do you still find some duds now and then? By emphasizing personality, work ethics, and the ability to work with others, you can sift through the deluge of candidates and separate the wheat from the chaff for qualities that will surely serve your business better in the long run.

 
Related: When clients don’t pay

 

Read this far? Grab my newsletter!

 

Service Monitoring – What is it and why is it important?

Data centers are complex beasts, and no amount of operator monitoring by itself can keep track of everything.  That’s why automated monitoring is so important.

So what should you monitor?  You can divide up your monitoring into a couple of strategic areas.  Just as with metrics collection, there is business & application level monitoring and then there is lower level system monitoring which is also important.

Business & Application Monitoring

  • If a user is getting an error page or cannot connect
  • If an e-commerce  transaction is failing
  • General service outages
  • If a business goal is met – or not
  • Page timeouts or slowness

Systems Level Monitoring

  • Backups completed and success
  • Error logs from database, webserver & other major services like email
  • Database replication is running
  • Webserver timeouts
  • Database timeouts
  • Replication failures – via error logs & checksum checks
  • Memory, CPU, Disk I/O, Server load average
  • Network latency
  • Network security

Tools that can perform this type of monitoring include Nagios,

Quora discussion – Web Operations Monitoring

Devops – What is it and why is it important?

Devops is one of those fancy contractions that tech folks just love.  One part development or developer, and another part operations.  It imagines a blissful marriage where the team that develops software and builds features that fit the business, works closely and in concert with an operations and datacenter team that thinks more like developers themselves.

In the long tradition of technology companies, two separate cultures comprise these two roles.  Developers, focused on development languages, libraries, and functionality that match the business requirements keep their gaze firmly in that direction.  The servers, network and resources those components of software are consuming are left for the ops teams to think about.

So too, ops teams are squarely focused on uptime, resource consumption, performance, availability, and always-on.  They will be the ones worken up at 4am if something goes down, and are thus sensitive to version changes, unplanned or unmanaged deployments, and resource heavy or resource wasteful code and technologies.

Lastly there are the QA teams tasked with quality assurance, testing, and making sure the ongoing dearth of features don’t break anything previously working or introduce new show stoppers.

Devops is a new and I think growing area where the three teams work more closely together.  But devops also speaks to the emerging area of cloud deployments, where servers can be provisioned with command line api calls, and completely scripted.  In this new world, infrastructure components all become components in software, and thus infrastructure itself, long the domain of manual processes, and labor intensive tasks becomes repeatable, and amenable to the techniques of good software development.  Suddenly version control, configuration management, and agile development methodologies can be applied to operations, bringing a whole new level of professionalism to deployments.

Sean Hull asks on Quora – What is devops and why is it important?

iHeavy Insights 77 – What Consultants Do

 

What Do Consultants Do?

Consultants bring a whole host of tools to experiences to bear on solving your business problems.  They can fill a need quickly, look in the right places, reframe the problem, communicate and get teams working together, and bring to light problems on the horizon. And they tell stories of challenges they faced at other businesses, and how they solved them.

Frame or Reframe The Problem

Oftentimes businesses see the symptoms of a larger problem, but not the cause.  Perhaps their website is sluggish at key times, causing them to lose customers.  Or perhaps it is locking up inexplicably.  Framing the problem may involve identifying the bottleneck and pointing to a particular misconfigured option in the database or webserver.  Or it may mean looking at the technical problem you’ve chosen to solve and asking if it meets or exceeds what the business needs.

Tell Business Stories

Clients often have a collection of technologies and components in place to meet their business needs.  But day-to-day running of a business is ultimately about bringing a product or service to your customer.  Telling stories of challenges and solutions of past customers, helps illustrate, educate, and communicate problems you’re facing today.

Fill A Need Quickly

If you have an urgent problem, and your current staff is over extended, bringing in a consultant to solve a specific problem can be a net gain for everyone.  They get up to speed quickly, bring fresh perspectives, and review your current processes and operations.  What’s more they can be used in a surgical way, to augment your team for a short stint.

Get Teams Communicating

I’ve worked at quite a number of firms over the years and tasked with solving a specific technical problem only to find the problem was a people problem to begin with.  In some cases the firm already has the knowledge and expertise to solve a problem, but some members are blocking.  This can be because some folks feel threatened by a new solution which will take away responsibilities they formerly held.  Or it can be because they feel some solution will create new problems which they will then be responsible to cleanup.  In either case bridging the gap between business needs and operations teams to solve those needs can mean communicating to each team in ways that make sense to them.  A technical detail oriented focus makes most sense when working with the engineering teams, business and bottom-line focused when communicating with the management team.

Highlight Or Bring To Light Problems On Horizon

Is our infrastructure a ticking timebomb?  Perhaps our backups haven’t been tested and are missing some crucial component?  Or we’ve missed some security consideration, left some password unset, left the proverbial gate open to the castle.  When you deal with your operations on a day-to-day basis, little details can be easy to miss.  A fresh perspective can bring needed insight.

BOOK REVIEW – Jaron Lanier – You Are Not a Gadget

Lanier is a programmer, musician, the father of VR way back in the 90’s, and wide-ranging thinker on topics in computing and the internet.

His new book is a great, if at times meandering read on technology, programming, schizophrenia, inflexible design decisions, marxism, finance transformed by cloud, obscurity & security, logical positivism, strange loops and more.

He opposes the thinking-du-jour among computer scientists, leaning in a more humanist direction summed up here:  “I believe humans are the result of billions of years of implicit, evolutionary study in the school of hard knocks.”    The book is worth a look.

iHeavy Insights 69 – Fewer Moving Parts

In a lot of different kinds of systems there are moving parts.  Electronics, automobiles, bridges and even living systems.  As it turns out in many if not most of these systems, the simpler designs tend to have various advantages over the more complex designs.  These benefits ring true in the business world as well.

Rock Climbing

Take the extreme sport rock climbing as an example.  I’ve been rock climbing off and on for about five years, though mostly indoors at rock climbing gyms.  One thing that you learn a lot about in rock climbing is safety.  There is a discussion of the harness, and how to double-back the waist cinch, and using multiple carabiners to lock into the rope, and then how to tie the rope in such a way that it tightens as it bears weight.  Both the person climbing and the person balaying – gathering the rope below – each have to take care of these things.  So generally they both check their own rope, harness, carabiners, and then check the other persons.

With indoor climbing this is all rather simple, and with just six checks for each climber to make, generally quite safe.  Plus there are monitors in the room watching people climb, and further checking for mistakes or oversights.  So over the years I’ve heard of practically *no* injuries in the gym.  It is so-called top-roping, and their are few moving parts.

With outdoor climbing you can do top-roping, however more advanced climbers prefer lead climbing.  It is much more challenging, and as I’ve described above there are many more moving parts.  The lead climber has to place “protection” into the rock every few meters.  These are special camming devices that grip into the rock.  Obviously all these components are not fool-proof, hence you want to add as many as possible.  But there are limits to endurance, and statistical averages at play, and more importantly many more moving parts.  So unfortunately lead climbing outdoors although possible to be on the safe side, tends to be much more prone to accidents.  More moving parts increases the statistical chance of a system breakdown.

iPhone

Something similar is at play when it comes to interface design.  With user interface or UI design, there is often a discussion of how many steps it takes to perform a function.  The more steps, the deeper the function is hidden.  Fewer steps means simplicity of design.

The iphone is a great example of this.  By simplifying the user interface, the machine works better.  At the Mobile World Congress last year Google announced that they get 50 times more searches from the iphone than *any* other mobile device.  Fifty times!  Think about that statistic.  This is more that flashy glitz and a pretty package.  This is a device that has fewer moving parts, not only in terms of buttons, but in the virtual interface components that a user navigates on the touch screen.

Internet & Engineering

Many of the same truisms that apply in the examples of rock climbing or smartphones also apply to internet systems, and the operations side of the business.  Can we use a web-services solution such a mailchimp.com to handle our email newsletter?  That means less to manage in-house, so our IT staff can focus on more important tasks.  Or how about outsource all email handling through a service like google’s Gmail for Business, or salesforce.com for CRM.

Simplifying your operations can also mean going with managing hosting solution, or better yet embracing the cloud with Amazon Web Services or Rackspace Cloud.   For that matter what database platform are you running on, or what computing platform?  Does it embrace the complexity and more  features philosophy?  Or does it strive for simplicity, and fewer moving parts?  And for that matter how many of those endless features are you actually using for your application?

Conclusion

As it turns out, engineers as much as business folks are wowed by endless features and the appeal of glitz and shine of a fancy new car.  But often in business what you need is reliability, simplicity, and fewer moving parts to get the job done, and get it done well.

5 Tips for Scalability

Your website is slow but you’re not sure why.  You do know that it’s impacting your business.  Are you losing customers to the competition? Here are five quick tips to achieve scalability

1. Gather Intelligence

With any detective work you need information.  That’s where intelligence comes in.  If you don’t have the right data already, install monitoring and trending systems such as Cacti and Collectd.  That way you can look at where your systems have been and where they’re going.

2. Identify Bottlenecks

Put all that information to use in your investigation.  Use stress testing tools to hit areas of the application, and identify which ones are most troublesome.  Some pages get hit A LOT, such as the login page, so slowness there is more serious than one small report that gets hit by only  a few users.  Work on the biggest culprits first to get the best bang for your buck.

3. Smooth Out the Wrinkles

Reconfigure your webservers to make more connections to your database, or spin-up more servers.  On the database tier make sure you have fast RAIDed disk, and lots of memory.  Tune queries coming from your application, and look at possible upgrades to servers.

4. Be Agile But Plan for the Future

Can your webserver tier scale horizontally?  Pretty easy to add more servers under a load balancer.  How about your database.  Chances are with a little work and some HA magic your database can scale out with more servers too, moving the bulk of select operations to read-only copies of your primary server, while letting it focus on transactions, and data updates.  Be ready and tested so you know exactly how to add servers without impacting the customers or application.  Don’t know how?  Look at the big guys like Facebook, an investigate how they’re doing it.

5. A Going Concern

Most importantly, just like your business, your technology infrastructure is an ongoing work in progress.  Stay proactive with monitoring, analysis, trending, and vigilance.  Watch application changes, and filter for slow queries.  Have new hardware or additional hardware dynamically at-the-ready for when you need it.