With traditional managed hosting solutions, we have best practices, we have business continuity plans, we have disaster recovery, we document our processes and all the moving parts in our infrastructure. At least we pay lip service to these goals, though from time to time we admit to getting side tracked with bigger fish to fry, high priorities and the emergency of the day. We add “firedrill” to our todo list, promising we’ll test restoring our backups. But many times we find it is in the event of an emergency that we are forced to find out if we actually have all the pieces backed up and can reassemble them properly.
Cloud Computing is different. These goals are no longer be lofty ideals, but must be put into practice. Here’s why.
- Virtual servers are not as reliable as physical servers
- Amazon EC2 has a lower SLA than many managed hosting providers
- Devops introduces new paradigm, infrastructure scripts can be version controlled
- EC2 environment really demands scripting and repeatability
- New flexibility and peace of mind
EC2 virtual servers can and will die. Your spinup scripts and infrastructure should consider this possibility not as some far off anomalous event, but a day-to-day concern. With proper scripts and testing of various scenarios, this should become manageable. Use snapshots to backup EBS root volumes, and build spinup scripts with AMIs that have all the components your application requires. Then test, test and test again.
Amazon EC2’s SLA – Only 99.95%
The computing industry throws around the 99.999% or five-nines uptime SLA standard around a lot. That amounts to less than six minutes of downtime. Amazon’s 99.95% allows for 263 minutes of downtime. Greater downtime merely gets you a credit on your account. With that in mind, repeatable processes and scripts to bring your infrastructure back up in different availability zones or even different datacenters is a necessity. Along with your infrastructure scripts, offsite backups also become a wise choice. You should further take advantage of availability zones and regions to make your infrastructure more robust. By using private IP addresses and network, you can host a MySQL database slave in a separate zone, for instance. You can also do GDLB or Geographically Distributed Load Balancing to send customers on the west coast to that zone, and those on the east coast to one closer to them. In the event that one region or availability zone goes out, your application is still responding, though perhaps with slightly degraded performance.
Devops – Infrastructure as Code
With traditional hosting, you either physically manage all of the components in your infrastructure, or have someone do it for you. Either way a phone call is required to get things done. With EC2, every piece of your infrastructure can be managed from code, so your infrastructure itself can be managed as software. Whether you’re using waterfall method, or agile as your software development lifecycle, you have the new flexibility to place all of these scripts and configuration files in version control. This raises manageability of your environment tremendously. It also provides a type of ongoing documentation of all of the moving parts. In a word, it forces you to deliver on all of those best practices you’ve been preaching over the years.
EC2 Environment Considerations
When servers get restarted they get new IP addresses – both private and public. This may affect configuration files from webservers to mail servers, and database replication too, for example. Your new server may mount an external EBS volume which contains your database. If that’s the case your start scripts should check for that, and not start MySQL until it finds that volume. To further complicate things, you may choose to use software raid over a handful of EBS volumes to get better performance.
The more special cases you have, the more you quickly realize how important it is to manage these things in software. The more the process needs to be repeated, the more the scripts will save you time.
New Flexibility in the Cloud
Ultimately if you take into consideration less reliable virtual servers, and mitigate that with zones and regions, and automated scripts, you can then enjoy all the new benefits of the cloud.
- easy test & dev environment setup
- robust load & scalability testing
- vertically scaling servers in place – in minutes!
- pause a server – incurring only storage costs for days or months as you like
- cheaper costs for applications with seasonal traffic patterns
- no huge up-front costs
Best practices for backups and disaster recovery aren’t tremendously different in the cloud than from a managed hosting environment. But they are more crucial since cloud servers are less reliable than physical servers. Also the security aspect may play a heightened role in the cloud. Here are some points to keep in mind.
Read the original article –
Intro to EC2 Cloud Deployments.
1. Perform multiple types of backups
2. Keep non-proprietary backups offsite
3. Test your backups – perform firedrills
4. Encrypt backups in S3
5. Perform Replication Integrity Checks Continue reading Backup and Recovery in EC2 – 5 Point Checklist
George Reese’s book doesn’t have the catchiest title, but the book is superb. One thing to keep in mind, it is not a nuts and bolts or howto type of book. Although there is a quick intro to EC2 APIs etc, you’re better off looking at the AWS docs, or Jeff Barr’s book on the subject. Reese’s book is really all about answering difficult questions involving cloud deployments. Continue reading Review: Cloud Application Architectures
Cloud Computing holds a lot of promise, but there are also a lot of speed bumps in the road along the way.
In this six part series we’re going to cover a lot of ground. We don’t intend this series to be an overly technical nuts and bolts howto. Rather we will discuss high level issues and answer questions that come up for CTOs, business managers, and startup CEOs.
Some of the tantalizing issues we’ll address include:
- How do I make sure my application is built for the cloud with scalability baked into the architecture?
- I know disk performance is crucial for my database tier. How do I get the best disk performance with Amazon Web Services & EC2?
- How do I keep my AWS passwords, keys & certificates secure?
- Should I be doing offsite backups as well, or are snapshots enough?
- Cloud providers such as Amazon seem to have poor SLAs (service level agreements). How do I mitigate this using availability zones & regions?
- Cloud hosting environments like Amazons provide no perimeter security. How do I use security groups to ensure my setup is robust and bulletproof?
- Cloud deployments change the entire procurement process, handing a lot of control over to the web operations team. How do I ensure that finance and ops are working together, and a ceiling budget is set and implemented?
- Reliability of Amazon EC2 servers is much lower than traditional hosted servers. Failure is inevitable. How do we use this fact to our advantage, forcing discipline in the deployment and disaster recovery processes? How do I make sure my processes are scripted & firedrill tested?
- Snapshot backups and other data stored in S3 are somewhat less secure than I’d like. Should I use encryption to protect this data? When and where should I use encrypted filesystems to protect my more sensitive data?
- How can I best use availability zones and regions to geographically disperse my data and increase availability?
As we publish each of the individual articles in this series we’ll link them to the titles below. So check back soon!
It may sound like a pessimistic view of computing systems, but the fact is all of the components that make up the modern Internet stack have a certain failure rate. So looking at that realistically, planning for a break-down so you can manage it better, is essential.
Failures in traditional datacenters
In your own datacenter, or that of your managed hosting provider sit racks and racks of servers. Typically an proactive system administrator will keep a lot of spare parts around, hard drives, switches, additional servers etc. Although you don’t need them now, you don’t want to be in a position to have to order new equipment when it fails. That would increase your recovery time dramatically.
Besides keeping extra components lying around, you also typically want to avoid the so-called single point of failure. Dual power systems, switches, database servers, webservers etc. We also see RAID as sort of standard now in all modern servers as a loss of commodity sata drive is so common. Yet this redundancy makes it a non-event. We are expecting it and so design for it.
And while we are prudent enough to perform backups regularly and document the layout of systems, rarely is the environment in a traditional datacenter completely scripted. Although attempts to test backups, and restore the database may be common, a full fire drill to rebuild everything is rarer.
Failure in the Cloud
In the last decade we saw Linux on commodity take over as the internet platform of choice because of the huge cost differential as compared to traditional hardware such as Sun or HP. The hardware was more likely to fail, but being 1/10th the price meant you could build redundancy in to cover yourself and still save money.
The latest wave of cloud providers are bringing the same types of costs savings. But cloud hosted servers, for instance in Amazon EC2 are much less reliable than typical rack mounted servers you might have in your datacenter.
Planning for disaster recovery we agree is a really good idea, but sometimes it gets pushed aside by other priorities. In the cloud it moves to front and center as an absolute necessity. This forces a new, more robust approach to rebuilding your environment with scripts documenting and formalizing your processes.
This is all a good thing as hardware failure then becomes an expected occurrence. Failures are a given, it’s how quickly you recover that makes the difference.
Cloud Application Architectures by George Reese
Originally picked up this book expecting a very hands on guide to cloud deployments, especially on EC2. That is not what this book is though. It’s actually a very good CTO targeted book, covering difficult questions like cost comparisons between cloud and traditional datacenter hosting, security implications, disaster recovery, performance and service levels. The book is very readable, and not overly technical.