Facebook, Is Anybody Listening?

If you weren’t actually using Facebook on Monday, you probably heard a coworker or friend complain it was down. Can you believe it?!?!

Also find Sean Hull’s ramblings on twitter @hullsean.

What Happened?

Facebook explained that they hit a DNS glitch. DNS the the internet’s phone book, but it’s all automated. It turns website URLs into numbers. Like phone numbers they route you to the right place. A mismatch here will send you to the wrong place, and hence no Facebook for you!

[quote]
Always on, 24×7 uptime has become de rigeur, almost a holy mantra that no one questions. But as we rely more heavily on web services for business, availability grows in importance. We need realistic expectations about uptime to plan accordingly.
[/quote]

Achieving HA in the Amazon cloud is even harder. Look at the outage that took out Reddit & AirBNB.

Who should care?

Whether facebook is online or not may seem like fun & games until you start tying business processes to the site. And we’re not just talking fan pages here. Facebook logins on sites like Spotify, Disqus, Xobni, Vimeo, CNN The Forum & Digg to name a few.

As more businesses rely on your platform, outages quickly multiply with collateral dammage.

Read this: The Myth of Five Nines – Why High Availability is Overrated.

Expectations of Perfection

The power grid can’t stay up with only five minutes of downtime per year, why should we expect online businesses to live to this standard. I work with a lot of startups, and universally 24×7 is expected. Other clients I work with, some hedge fund, legal or news providers and they don’t always have this expectation. Even banks, it is only the very largest ones who are also global, that promise 24×7 services.

I would argue it is cultural. Look at this whitepaper Bellcore Standards – Myth versus Reality. The real world is messier than calculations and probabilities. It’s time we brought the bar down a notch, and give operations folks a pat on the back for the heroic effort they do, and the huge uptime they’re already providing!

What did we learn from Sandy? A lot about disaster recovery, that’s what.

Want more? Grab our Scalable Startups monthly for more tips and special content. Here’s a sample