I recent went on a flight with a pilot friend & two others. With only 3 passengers, you get up front and center to all the action. Besides getting to chat directly with the pilot, you also see all of the many checks & balances in place. They call it the warrior aviation checklist. It inspired this post on disaster recovery in the datacenter.
Join 8000 others and follow Sean Hull on twitter @hullsean.
1. Have a real plan B
Looking over the document you can see there are procedures for engine failure in flight! That’s right those small planes can have an engine failure and they glide. So for starters the technology is designed for failure. But the technology design is not the end of the story.
When you have procedures and processes in place, and training to deal with failure, you’re expecting and planning for it.
Also check out 5 Conversational Ways to Evaluate Great Consultants.
Expect & plan for failure. Use technologies designed to fail gracefully and build software to do so. Then put procedures in place to detect & alert you so you can put those processes into place quickly.
2. Use checklists
Checklists are an important part of good process. We use them for code deploys. We can use them for disaster recovery too. But how to create them?
Firedrills! That’s right, run through the entire process of disaster recovery with your team & document. As you do so, you’ll be creating the master checklist for real disaster recovery. Keep it in a safe place, or better multiple places so you’ll have it when you really need it.
Read this Scalability Tips & Greatest Hits.
3. Trust your instruments
Modern infrastructures include 10’s or 100’s of servers. With cloud instances so simple to spinup, we add new services and servers daily. What to do?
You obviously need to use the machines to monitor the machines. Have a good monitoring system like Nagios, and metrics collection such as New Relic to help you stay ahead of failure.
Setup the right monitoring automation, and then trust the instruments!