Tag Archives: lessons

Real Disaster Recovery Lessons from Sandy

Also find Sean Hull’s ramblings on twitter @hullsean.

Having just spent the last 24 hours in lower manhattan, while Hurricane Sandy rolled through, it’s offered some first hand lessons on disaster recovery. Watching the city and state officials, Con Edison, first responders and hospitals deal with the disaster brings some salient insights.

1. What are your essentials?

Planning for disaster isn’t easy. Thinking about essentials is a good first question. For a real-life disaster scenario it might mean food, water, heat and power. What about backup power? Are your foods non-parishable? Do you have hands free flashlight or lamp? Have you thought about communication & coordination with your loved ones? Do you have an alternate cellular provider if your main one goes out?

With business continuity, coordinating between business units, operations teams, and datacenter admins is crucial. Running through essential services, understanding out they interoperate, who needs to be involved in what decisions and so far is key.

Here’s a real-world story where we lost a database, what caused it and how we recovered.

2. What can you turn off?

While power is being restored, or some redundant services are offline, how can you still provide limited or degraded service? In the case of Sandy, can we move people to unaffected areas? Can we reroute power to population centers? Can we provide cellular service even while regular power is out?

[quote]Hurricane Sandy has brought devastation to the East Coast. But strong coordinated efforts between NYC, State & Federal agencies has reduced the impact dramatically. We can learn a lot about disaster recovery in web operations from their model.
[/quote]

For web applications and datacenters, this can mean applications built with feature flags, we’ve mentioned before on this blog.

Also very important, architect your application to have a browse only mode. This allows you to service customers off of multiple webservers in various zones or regions, using lots of read-replicas or read-only MySQL slave databases. It’s easy to build lots of read-only copies of your data while there are no changes or transactions taking place.

More redundancy equals more uptime.

Like this topic? Grab our newsletter

3. Did we test the plan?

A disaster is never predictable, but watching the emergency services for the city was illustrative of some very good response. They outlined mandatory evacuation zones, where flooding was expected to be worst.

In a datacenter, fire drills can make a big difference. Running through them gives you a sense of the time it takes to restore service, what type of hurdles you’ll face, and a checklist to summarize things. In real life, expect things to take longer than you planned.

Probably the hardest part of testing is to devise scenarios. What happens if this server dies? What happens if this service fails? Be conservative with your estimates, to provide more time as things tend to unravel in an actual disaster.

Here are 5 ways to avoid EC2 outages.

4. Redundancy

In a disaster, redundancy is everything. Since you don’t know what the future will hold, better to be prepared. Have more water than you think you’ll need. Have additional power sources, bathrooms, or a plan B for shelter if you become flooded.

With Amazon’s recent outage, quite a number of internet firms failed. In our view AirBNB, FourSquare and Reddit Didn’t Have to Fail. Spreading your virtual components and services across zones and regions would help, but further across multiple cloud providers not just Amazon Web Services, but Joyent, Rackspace or other third party providers would give you further insurance against a failure in one single provider.

Redundancy also means providing multiple paths through system. From load balancers, to webservers and database servers, object caches and search servers, do you have any single points of failure? Single network path? Single place where some piece of data resides?

5. Remember the big picture

While chaos is swirling, and everyone is screaming, it’s important that everyone keep sight of the big picture. Having a central authority projecting a sense of calm and togetherness doesn’t hurt. It’s also important that multiple departments, agencies, or parts of the organization continue to coordinate towards a common goal. This coordinated effort could be seen clearly during Sandy, while Federal, State and City authorities worked together.

In the datacenter, it’s easy obsess over details and lose site of the big picture. Technical solutions and decisions need to be aligned with ultimate business needs. This also goes for business units. If a decision is unilaterally made that publish cannot be offline for even five minutes, such a tight constraint might cause errors and lead to larger outages.

Coordinate together, and everyone keep sight of the big picture – keeping the business running.

Speaking of the big picture, here’s Why generalists are better at scaling the web.

Read this far? Grab our newsletter Scalable Startups.

Road War Story – Hacking Inflight Solutions

 

The 2am phone call

Last summer I got my call from the president at 2am.  Actually it was my former boss at Hollywood Reporter.  I had worked there three months previous, and they had since hired an outsourced DBA solution.  Big outsource, big chops.  And big fail.

 

 

12 hours to liftoff

I was scrambling to pack my luggage to go on summer vacation.  I was bound for SF at the moment and my flight was leaving in the morning.  I was trying to wrap up loose ends and my former boss was entreating me – “Can you help us?  Our replication setup has just melted down.  We need you to cleanup the mess.”

The so-called pain point

After a few more early am Skype calls and chats, the team retired for the night and I finished packing my bags.   I snuck in an hour of sleep then headed straight for the airport.  Once through airport security, I bust out my laptop and start logging into the servers.

Although the exact cause of the replication failure remained opaque, I was asked to scan both databases and determine differences.  Out of my toolbox comes the perfect tool for the job, pt-table-checksum, and I run scans on both databases.  (For the curious, here is how) I find countless records different between the two databases.

Now my flight is boarding, so I pack up the laptop and find my seat.  As soon as the seat belt lights flash off, I’m flipping open my macbook at getting inflight wifi working.  Through the flight I’m on SKYPE with the team, with command line terminals open to the servers.  Discuss, debug, troubleshoot – rinse, repeat.

From there I write up a report and explain to the team & CTO the problem.  Syncing that many different records is too risky.  We’d have to review all the statements one-by-one.  I’d rather rebuild replication from scratch.

From there the CTO gives the go ahead, and with the help of Percona’s xtrabackup to do online hotbackups, we are able to fix replication without downtime. Amen to that!

Now with our primary MySQL database and secondary read-only one back online, things calm down a lot.  Traffic returns to a smooth predictable 2 million pageviews per day.  That’s smooth and predictable on a site that gets 50 million a month!   The database loads are calm and steady, as our all of our nerves.   In the coming days we continue to monitor the situation, and write up lengthly root cause analysis of the situation.

Freelancers & Consultants take note

To my recent Consulting 101 article I would add the following bullets:

  • Responsiveness is crucial
  • Be there when a client needs you, and your value goes up.  Be reliable, and loyal to those you’ve worked with.

  • Be an integral part of your team
  • Everyone knows eachother virtual or in real life, and are comfortable with the parts they play.  A team that can work together is crucial, whether it’s all fulltime folks, some consultants, some outsourced or wherever they may be.  Each has a role to play, and communication and team work brings it all together.

  • Have laptop will travel
  • I never turn down a job.  There will be plenty of time for vacations and rest when the dust settles.

  • Don’t break things
  • If there is any doubt in your mind, test, and test again.  Always err on the side of caution.  Check thrice and cut once!  If you haven’t done an operation ten, twenty or fifty times before, experiment a few more times with options to be sure.  And most importantly, if you don’t login to the systems you’re working on regularly, you better make damn sure you’re on the right box, flipping the right switch, and moving the right dials.  With modern internet infrastructure, there are a hundred ways to push the wrong red button!

    CTOs and Directors of Operations take note

  • Small & Nimble wins the day
  • I’ve used this value proposition before when speaking to prospects.  You can hire a big firm, and be a small fish to them.  Small fish means you’re gonna get less attention.  OR you can hire a small firm or contractor.  Then you’ll be a big fish to him or her.  Guess what?  If you’re their big fish, they’re gonna pay extra attention to every move they make, and ensure things don’t break.  They can’t afford mistakes, not to their reputation or their bottom line.  Not like the big boys can.

  • Choose passionate, yet conservative & risk averse operations folks
  • In developers you’re building technology, features, and forging ahead into new solutions.  The role is more to create waves, and break barriers.  How can we enable new business processes and so forth?

    In hiring operations personnel you want stability.  Look for individuals who are more risk averse.  This conservative streak is a countering force.  Ops teams are tasked with that job of bringing a steady state to your business services.  They don’t want to wake up at 2am in the morning.