5 Ways to Avoid EC2 Outages

1. Backup outside of the Cloud

Some of the high profile companies affected by Amazon’s April 2011 outage could have recovered had they kept a backup of their entire site outside of the cloud.  With any hosting provider, managed traditional data center or cloud provider, alternate backups are always a good idea.  A MySQL logical backup and/or incremental backup can be copied regularly offsite or to an alternate cloud provider.  That’s real insurance!

2. Use alternate regions and availability zones

Amazon’s outage in April underscored a lot of things.  That the service is still evolving and experiencing growing pains is one, and that many many internet firms rely on them is yet another.  Amazon’s own documentation and best practices outline the need for mitigating against server failure.  Amazon instances are not as reliable as instances in a traditional hosting center, just as commodity hardware was not as reliable as Sun hardware all the while it was replacing it.

By using multiple availability zones and regions in your infrastructure design, you mitigate against lower SLAs and less reliable servers.  This is not an option, but rather a requirement to build a resilient infrastructure.

3. Use alternate cloud providers

As we mentioned previously at the very least an alternate cloud provider can be used for your backups.  But to build a further fault tolerant infrastructure, build servers and provide services out of multiple cloud providers.  This type of setup can provide the highest availability and protection against outages as it becomes very unlikely that any regional or business failure could take down multiple providers in different geographic regions.

4. Design & test for failure

Fire drills are a great way to test for outage in a planned way.  Run through all the steps to rebuild all components in your site.  Document those processes carefully as you go through them.  A central repository of documentation stands in for your lead engineer so that the knowledge to rebuild your site does not rely on any one single person in your business.

Further testing can involve a more shotgun approach.  This tests for true resiliency but requires a robustness of architecture, process and automation that is at the very highest level.  That’s what neflix does with their chaos monkey.

5. Include configs, code, files & database

Your backups are best verified by fire drill.  Rebuilding the entire infrastructure from bare metal involves many moving parts.  From the lowest level you have hardware, then you have AMIs with the OS and from there you install particular packages that your application requires.  Those packages need configurations hopefully managed by a configuration management systems such as cfengine, puppet or chef.

On top of all of that you need your application itself, and any files that it builds or relies on in the filesystem.  Lastly your single backend datastore, likely your relational database such as MySQL would require it’s data backup to restore all the stateful data your application relies on.