Tag Archives: monitoring

When You Have to Take the Fall

Also find Sean Hull’s ramblings on twitter @hullsean.

One of the biggest jobs in operations is monitoring. There are so many servers, databases, webservers, search servers, backup servers. Each has lots of moving parts, lots that can go wrong. Typically if you have monitoring, and react to that monitoring, you’ll head off bigger problems later.

A problem is brewing

We, myself & the operations team started receiving alerts for one server. Space was filling up. Anyone can relate to this problem. You fill up your dropbox, or the drive on your laptop and all sorts of problems will quickly bubble to the surface.

Also check out – Why generalists are better at scaling the web.

As we investigated over the coming days, a complicated chain of processes and backups were using space on this server. Space that didn’t belong to them.

Dinner boils over

What happened next was inevitable. The weekly batch jobs kicked off and failed for lack of space. Those processes were not being monitored. Business units then discovered missing data in their reports and a firestorm of emails ensued.

Hiring? Get our MySQL DBA Interview Guide for managers, recruiters and candidates alike.

Why weren’t these services being monitored, they wanted to know.

Time to shoot the messenger

Having recently seen a changing of the guard, and a couple of key positions left vacant, it was clear that the root problem was communication.

Looking for talent? Why is it so hard to find a mythical MySQL DBA or devops expert these days?

I followed up the group emails, explaining in polite tone that we do in fact have monitoring in place, but that it seemed a clear chain of command was missing, and this process fell through the cracks.

I quickly received a response from the CTO requesting that I not send “these types of emails” to the team and to direct issues directly to him.

You might also like: A CTO Must Never Do This

A consultants job

As the sands continued to shift, a lead architect did emerge, one who took ownership of the products overall. Acting as a sort of life guard with a higher perch from which to watch, we were able to escalate important issues & he would then prioritize the team accordingly.

Are you a startup grappling with scalability? Keep in mind these 5 things toxic to scalability

Sometimes things have to break a little first.

What’s more a consultants job isn’t necessarily to lead the pack, nor to force management to act. A consultant’s job is to provide the best advice possible & to raise issues to the decision makers. And yes sometimes it means being a bit of a fall guy.

Those are the breaks of the game.

Want more? Grab our Scalable Startups monthly for more tips and special content. Here’s a sample

Service Monitoring – What is it and why is it important?

Data centers are complex beasts, and no amount of operator monitoring by itself can keep track of everything.  That’s why automated monitoring is so important.

So what should you monitor?  You can divide up your monitoring into a couple of strategic areas.  Just as with metrics collection, there is business & application level monitoring and then there is lower level system monitoring which is also important.

Business & Application Monitoring

  • If a user is getting an error page or cannot connect
  • If an e-commerce  transaction is failing
  • General service outages
  • If a business goal is met – or not
  • Page timeouts or slowness

Systems Level Monitoring

  • Backups completed and success
  • Error logs from database, webserver & other major services like email
  • Database replication is running
  • Webserver timeouts
  • Database timeouts
  • Replication failures – via error logs & checksum checks
  • Memory, CPU, Disk I/O, Server load average
  • Network latency
  • Network security

Tools that can perform this type of monitoring include Nagios,

Quora discussion – Web Operations Monitoring

iHeavy Insights 68 – Transparency

The analogy du jour for cleaning up the financial mess is that sunshine makes the best disinfectant.  The idea is to push for more corporate transparency as a cleaning agent upon our current financial troubles.   Whether this cleaning job will have longstanding impact remains to be seen, however it’s clear that transparency is good for markets and economic stability.
In computing that same sunshine can be put to work as a disinfectant as well.  Transparency is as important for your cloud hosted application or traditional servers alike.  So how does it work?
Your typical internet application consists of a whole fleet of servers working together to do work for you.  Unlike automobiles, bridges, buildings or even most electronics however, the construction is constantly changing.  In effect these are buildings that are always being built, and bridges always being expanded.  Due to their changing nature, their behavior changes as well.  That’s where transparency comes in.
There are a number of great historical data tools specifically designed to capture the myriad of different metrics on your servers and then analyze and graph that information for you offline.  We like offline because that means the monitoring itself won’t affect or impact the performance of your application and servers.  Some of the tools of choice today include Munin, Cacti, and Collectd.  They each have their own strengths and weaknesses in terms of installation, configurability and so forth.  What they all have in common though is the transparency they provide.
Once installed, they will begin happily collecting information and monitoring your servers, all day and all night long even while you are enjoying your sunday brunch.
Are you looking at an outage that you encountered yesterday at 11pm?  Did your customers have trouble ordering your products, or utilizing your service? Fire up your cacti graphs, and drill down to that time window, and then review the various metrics to see what they reveal.
Having the right information at your fingertips is the first step in being able to resolve troubles.  Only with the right information can you fix these problems, and serve your customers what they expect.  So follow the analogy of using sunshine as a disinfectant and shine some light into your complex cloud environments. Let transparency lead you to the root of the problem and clean it up before it touches your customers.

The analogy du jour for cleaning up the financial mess is that sunshine makes the best disinfectant.  The idea is to push for more corporate transparency as a cleaning agent upon our current financial troubles.   Whether this cleaning job will have longstanding impact remains to be seen, however it’s clear that transparency is good for markets and economic stability.

In computing that same sunshine can be put to work as a disinfectant as well.  Transparency is as important for your cloud hosted application or traditional servers alike.  So how does it work?

Your typical internet application consists of a whole fleet of servers working together to do work for you.  Unlike automobiles, bridges, buildings or even most electronics however, the construction is constantly changing.  In effect these are buildings that are always being built, and bridges always being expanded.  Due to their changing nature, their behavior changes as well.  That’s where transparency comes in.

There are a number of great historical data tools specifically designed to capture the myriad of different metrics on your servers and then analyze and graph that information for you offline.  We like offline because that means the monitoring itself won’t affect or impact the performance of your application and servers.  Some of the tools of choice today include Munin, Cacti, and Collectd.  They each have their own strengths and weaknesses in terms of installation, configurability and so forth.  What they all have in common though is the transparency they provide.

Once installed, they will begin happily collecting information and monitoring your servers, all day and all night long even while you are enjoying your sunday brunch.

Are you looking at an outage that you encountered yesterday at 11pm?  Did your customers have trouble ordering your products, or utilizing your service? Fire up your cacti graphs, and drill down to that time window, and then review the various metrics to see what they reveal.

Having the right information at your fingertips is the first step in being able to resolve troubles.  Only with the right information can you fix these problems, and serve your customers what they expect.  So follow the analogy of using sunshine as a disinfectant and shine some light into your complex cloud environments. Let transparency lead you to the root of the problem and clean it up before it touches your customers.

Book Review:  The Ascent of Money – Niall Ferguson

When I think back to the dot-com days, I recall euphoria in people’s eyes.  It was that excitement in the face of making boat loads of money off the stock market that I remember clearly.  It is the excitement of the gambler, the thought of taking the shortcut, of getting something for nothing.  I remember seeing that same look in people’s eyes when they talked about housing just a short few years ago.  Talk of flipping houses and making money without adding anything.

It’s after the bubble bursts that everyone starts to think clearly again.  The tide has receded and we are left wondering how there could be bathers who weren’t wearing bathing suits, while it’s now plain for all to see.

Niall Ferguson’s book chronicles money’s use through history both the good and the bad.  By putting the current financial mess into historical perspective, he offers us new insights into our current predicament, helping us chart the way forward.  For anyone wanting to understand the financial forces around us, this is definitely a book worth reading.