Saw this awesome infographic by Cloud Spectator on twitter today. A great snapshot of the expected growth in Cloud Computing. If the Cloud Computing market was stacked up against world economies it would be the 51st largest in the world with massive inequality – Amazon has half the share.
If you’re headhunting a cloud computing expert, specifically someone who knows Amazon Web Services (AWS) and EC2, you’ll want to have a battery of questions to ask them to assess their knowledge. As with any technical interview focus on concepts and big picture. As the 37Signals folks like to say “hire for attitude, train for skill”. Absolutely!
If you want more general info about Amazon Web Services, read our Intro to EC2 Deployments.
1. Explain Elastic Block Storage? What type of performance can you expect? How do you back it up? How do you improve performance?
EBS is a virtualized SAN or storage area network. That means it is RAID storage to start with so it’s redundant and fault tolerant. If disks die in that RAID you don’t lose data. Great! It is also virtualized, so you can provision and allocate storage, and attach it to your server with various API calls. No calling the storage expert and asking him or her to run specialized commands from the hardware vendor.
Performance on EBS can exhibit variability. That is it can go above the SLA performance level, then drop below it. The SLA provides you with an average disk I/O rate you can expect. This can frustrate some folks especially performance experts who expect reliable and consistent disk throughput on a server. Traditional physically hosted servers behave that way. Virtual AWS instances do not.
Related: Is Amazon too big to fail?
Backup EBS volumes by using the snapshot facility via API call or via a GUI interface like elasticfox.
Improve performance by using Linux software raid and striping across four volumes.
2. What is S3? What is it used for? Should encryption be used?
S3 stands for Simple Storage Service. You can think of it like ftp storage, where you can move files to and from there, but not mount it like a filesystem. AWS automatically puts your snapshots there, as well as AMIs there. Encryption should be considered for sensitive data, as S3 is a proprietary technology developed by Amazon themselves, and as yet unproven vis-a-vis a security standpoint.
3. What is an AMI? How do I build one?
AMI stands for Amazon Machine Image. It is effectively a snapshot of the root filesystem. Commodity hardware servers have a bios that points the the master boot record of the first block on a disk. A disk image though can sit anywhere physically on a disk, so Linux can boot from an arbitrary location on the EBS storage network.
Need an AWS expert? Email me for a quote hullsean @ gmail.com
Build a new AMI by first spinning up and instance from a trusted AMI. Then adding packages and components as required. Be wary of putting sensitive data onto an AMI. For instance your access credentials should be added to an instance after spinup. With a database, mount an outside volume that holds your MySQL data after spinup as well.
4. Can I vertically scale an Amazon instance? How?
Yes. This is an incredible feature of AWS and cloud virtualization. Spinup a new larger instance than the one you are currently running. Pause that instance and detach the root ebs volume from this server and discard. Then stop your live instance, detach its root volume. Note the unique device ID and attach that root volume to your new server. And the start it again. Voila you have scaled vertically in-place!!
5. What is auto-scaling? How does it work?
Autoscaling is a feature of AWS which allows you to configure and automatically provision and spinup new instances without the need for your intervention. You do this by setting thresholds and metrics to monitor. When those thresholds are crossed a new instance of your choosing will be spun up, configured, and rolled into the load balancer pool. Voila you’ve scaled horizontally without any operator intervention!
With MySQL databases autoscaling can get a little dicey, so we wrote a guide to autoscaling MySQL on amazon EC2.
6. What automation tools can I use to spinup servers?
The most obvious way is to roll-your-own scripts, and use the AWS API tools. Such scripts could be written in bash, python or another language or your choice. Next option is to use a configuration management and provisioning tool like puppet or better it’s successor Opscode Chef. Ansible is also an excellent option because it doesn’t require an agent, and can run your shell scripts as-is. You might also look towards CloudFormation or Terraform. The resulting code captures your entire infrastructure, can be checked into your git repository & version controlled. You can even unit test this way!
7. What is configuration management? Why would I want to use it with cloud provisioning of resources?
Configuration management has been around for a long time in web operations and systems administration. Yet the cultural popularity of it has been limited. Most systems administrators configure machines as software was developed before version control – that is manually making changes on servers. Each server can then and usually is slightly different. Troubleshooting though is straightforward as you login to the box and operate on it directly. Configuration management brings a large automation tool into the picture, managing servers like strings of a puppet. This forces standardization, best practices, and reproducibility as all configs are versioned and managed. It also introduces a new way of working which is the biggest hurdle to its adoption.
Enter the cloud, and configuration management becomes even more critical. That’s because virtual servers such as amazons EC2 instances are much less reliable than physical ones. You absolutely need a mechanism to rebuild them as-is at any moment. This pushes best practices like automation, reproducibility and disaster recovery into center stage.
While on the subject of configuration management take a quick peek at hiring a devops guide.
8. Explain how you would simulate perimeter security using Amazon Web Services model?
Traditional perimeter security that we’re already familiar with using firewalls and so forth is not supported in the Amazon EC2 world. AWS supports security groups. One can create a security group for a jump box with ssh access – only port 22 open. From there a webserver group and database group are created. The webserver group allows 80 and 443 from the world, but port 22 *only* from the jump box group. Further the database group allows port 3306 from the webserver group and port 22 from the jump box group. Add any machines to the webserver group and they can all hit the database. No one from the world can, and no one can directly ssh to any of your boxes.
The more full featured way to go is VPC. That’s Amazon’s acronym for virtual private cloud. You can create virtual networks both private & public, with subnets etc all within VPCs. You then spinup servers & resources inside those virtual networks. VPCs can be control with security groups or the more powerful but messy access control lists.
Want to further lock this configuration down? Only allow ssh access from specific IP addresses on your network, or allow just your subnet.
Did you make it this far?!?! Grab our newsletter.
Amazon Web Services is a division of Amazon the bookseller, but this part of the business is devoted solely to infrastructure and internet servers. These are the building blocks of data centers, the workhorses of the internet. AWS’s offering of Cloud Computing solutions allows a business to setup or “spinup” in the jargon of cloud computing, new compute resources at will. Need a small single cpu 32bit ubuntu server with two 20G disks attached? One command and 30 seconds away, and you can have that!
As we discussed previously, Infrastructure Provisioning has evolved dramatically over the past fifteen years from something took time and cost a lot, to a fast automatic process that it is today with cloud computing. This has also brought with it a dramatic culture shift in the way that systems administration is being done, from a fairly manual process of physical machines, and software configuration, one that took weeks to setup new services, to a scriptable and automateable process that can then take seconds.
This new realm of cloud computing infrastructure and provisioning is called Infrastructure as a Service or IaaS, and Amazon Web Services is one of the largest providers of such compute resources. They’re not the only ones of course. Others include:
- Rackspace Cloud
Cloud Computing is still in it’s infancy, but is growing quickly. Amazon themselves had a major data center outage in April that we discussed in detail. It sent some hot internet startups into a tailspin!
Backups are obviously an important part of any managed infrastructure deployment. Computing systems are inherently fallible, through operator error or hardware failure. Existing systems must be backed up, from configurations, software and media files, to the backend data store.
In a managed hosting environment or cloud hosting environment, it is convenient to use various filesystem snapshot technologies to perform backups of entire disk volumes in one go. These are powerful, fast, reliable, and easy to execute. In Amazon EC2 for example these EBS snapshots are stored on S3. But what happens if your data center goes down – through network outage or power failure? Or further what happens if S3 goes offline? Similar failures can affect traditional managed hosting facilities as well.
This is where offsite backups come in handy. You would the be able to rebuild your application stack and infrastructure despite your entire production servers being offline. That’s peace of mind! Offsite backups can come in many different flavors:
- mysqldump of the entire database, performed daily and copied to alternate hosting facility
- semi-synchronous replication slave to alternate datacenter or region
- DRBD setup – distributed filesystem upon which your database runs
- replicated copy of version control repository – housing software, documentation & configurations
Offsite backups can also be coupled with a frequent sync of the binlog files (transaction logs). These in combination with your full database dump will allow you to perform point-in-time recovery to the exact point the outage began, further reducing potential data loss.
Now that we’ve had a chance to take a deep breath after last week’s AWS outage, I’ll offer some comments of my own. Hopefully just enough time has passed to begin to have a broader view, and put events in perspective.
Despite what some reports may have announced, Amazon wasn’t down, but rather a small part of Amazon Web Services went down. A failure, yes. Beyond their service level agreement of 99.95% yes also. Survivable, yes to this last question too.
Learning From Failure
The business management conversation du jour is all about learning from failure, rather than trying to avoid it. Harvard Business Review’s April issue headlined with “The Failure Issue – How to Understand It, Learn From It, and Recover From It”. The economist’s April 16th issue had some similarly interesting pieces one by Schumpeter “Fail often, fail well”,
and another in April 23rd issue “Lessons from Deepwater Horizon and Fukushima”.
With all this talk of failure there is surely one takeaway. Complex systems will fail and it is in the anticipation of that failure that we gain the most. Let’s stop howling and look at how to handle these situations intelligently.
How Do You Rebuild A Website?
In the cloud you will likely need two things. (a) scripts to rebuild all the components in your architecture, spinup servers, fetch source code, fetch software and configuration files, configure load balancers and mount your database and more importantly (b) a database backup from which you can rebuild your current dataset.
Want to stick with EC2, build out your infrastructure in an alternate availability zone or region and you’re back up and running in hours. Or better yet have an alternate cloud provider on hand to handle these rare outages. The choice is yours.
Mitigate risk? Yes indeed failure is more common in the cloud, but recovery is also easier. Failure should pressure the adoption of best practices and force discipline in deployments, not make you more of a gunslinger!
Want to see an extreme example of how this can play in your favor? Read Jeff Atwood’s discussion of so-called Chaos Monkey, a component whose sole job it is to randomly kill off servers in the Netflix environment at random. Now that type of gunslinging will surely keep everyone on their toes! Here’s a Wired article that discusses Chaos Monkey.
George Reese of enStratus discusses the recent failure at length. The I would argue calling Amazon’s outage the Cloud’s Shing Moment, all of his points are wisened and this is the direction we should all be moving.
Going The Way of Commodity Hardware
Though it is still not obvious to everyone, I’ll spell it out loud and clear. Like it or not, the cloud is coming. Look at these numbers.
Furthermore the recent outage also highlights how much and how many internet sites rely on cloud computing, and Amazon EC2.
Way back in 2001 I authored a book on O’Reilly called “Oracle and Open Source”. In it I discussed the technologies I was seeing in the real world. Oracle on the backend and Linux, Apache, and PHP, Perl or some other language on the frontend. These were the technologies that startups were using. They were fast, cheap and with the right smarts reliable too.
Around that time Oracle started smelling the coffee and ported it’s enterprise database to Linux. The equation for them was simple. Customers that were previously paying tons of money to their good friend and confidant Sun for hardware, could now spend 1/10th as much on hardware and shift a lot of that left over cash to – you guessed it Oracle! The hardware wasn’t as good, but who cares because you can get a lot more of it.
Despite a long entrenched and trusted brand like Sun being better and more reliable, guess what? Folks still switched to commodity hardware. Now this is so obvious, no one questions it. But the same trend is happening with cloud computing.
Performance is variable, disk I/O can be iffy, and what’s more the recent outage illustrates front and center, the servers and network can crash at any moment. Who in their right mind would want to move to this platform?
If that’s the question you’re stuck on, you’re still stuck on the old model. You have not truely comprehended the power to build infrastructure with code, to provision through automation, and really embrace managing those components as software. As the internet itself has the ability to route around political strife, and network outages, so too does cloud computing bring that power to mom & pop web shops.
- Have existing investments in hardware? Slow and cautious adoption makes most sense for you.
- Have seasonal traffic variations? An application like this is uniquely suited to the cloud. In fact some of the gaming applications which can autoscale to 10x or 100x servers under load, are newly solveable with the advent of cloud computing.
- Are you currently paying a lot for disaster recovery systems that primarily lay idle. Script your infrastructure for rebuilding from bare metal, and save that part of the budget for more useful projects.
Best practices for backups and disaster recovery aren’t tremendously different in the cloud than from a managed hosting environment. But they are more crucial since cloud servers are less reliable than physical servers. Also the security aspect may play a heightened role in the cloud. Here are some points to keep in mind.
Read the original article –
Intro to EC2 Cloud Deployments.
1. Perform multiple types of backups
2. Keep non-proprietary backups offsite
3. Test your backups – perform firedrills
4. Encrypt backups in S3
5. Perform Replication Integrity Checks Continue reading Backup and Recovery in EC2 – 5 Point Checklist
Security is on everyone’s mind when talking about the cloud. What are some important considerations?
For the web operations team:
- AWS has no perimeter security, should this be an overriding concern?
- How do I manage authentication keys?
- How do I harden my machine images?
Amazon’s security groups can provide strong security if used properly. Create security groups with specific minimum privileges, and do not expose your sensitive data – ie database to the internet directly, but only to other security groups. On the positive side, AWS security groups mean there is no single point to mount an attack against as with a traditional enterprises network security. What’s more there is no opportunity to accidentally erase network rules since they are defined in groups in AWS.
Authentication keys can be managed in a couple of different ways. One way is to build them into the AMI. From there any server spinup based on that AMI will be accessible by the owner of those credentials. Alternatively a more flexible approach would be to pass in the credentials when you spinup the server, allowing you to dynamically control who has access to that server.
Hardening your AMIs in EC2 is much like hardening any Unix or Linux server. Disable user accounts, ssh password authentication, and unnecessary services. Consider a tool like AppArmor to fence applications in and keep them out of areas they don’t belong. This can be an ongoing process that is repeated if the unfortunate happens and you are compromised.
You should also consider:
- AWS password recovery mechanism is not as secure as a traditional managed hosting provider. Use a very strong password to lock down your AWS account and monitor it’s usage.
- Consider encrypted filesystems for your database mount point. Pass in decryption key at server spinup time.
- Consider storing particularly sensitive data outside of the cloud and expose through SSL API call.
- Consider encrypting your backups. S3 security is not proven.
For CTOs and Operations Managers:
- Where is my data physically located?
- Should I rely entirely on one provider?
- What if my cloud provider does not sufficiently protect the network?
Although you do not know where your data is physically located in S3 and EC2, you have the choice of whether or not to encrypt your data and/or the entire filesystem. You also control access to the server. So from a technical standpoint it may not matter whether you control where the server is physically. Of course laws, standards and compliance rules may dictate otherwise.
You also don’t want to put all your eggs in one basket. There are all sorts of things that can happen to a provider, from going out of business, to lawsuits that directly or indirectly affect you to even political pressure as in the wikileaks case. A cloud provider may well choose the easier road and pull the plug rather than deal with any complicated legal entanglements. For all these reasons you should be keeping regular backups of your data either on in-house servers, or alternatively at a second provider.
As a further insurance option, consider host intrusion detection software. This will give you additional peace of mind against the potential of your cloud provider not sufficiently protecting their own network.
Additionally consider that:
- A simple password recovery mechanism in AWS is all that sits between you and a hacker to your infrastructure. Choose a very secure password, and monitor it’s usage.
- EC2 servers are not nearly as reliable as traditional physical servers. Test your deployment scripts, and your disaster recovery scenarios again and again.
- Responding to a compromise will be much easier in the cloud. Spinup the replacement server, and keep the EBS volume around for later analysis.
As with any new paradigm there is an element of the unknown and unproven which we are understandably concerned about. Cloud hosted servers and computing can be just as secure if not more secure than traditional managed servers, or servers you can physically touch in-house.
George Reese’s book doesn’t have the catchiest title, but the book is superb. One thing to keep in mind, it is not a nuts and bolts or howto type of book. Although there is a quick intro to EC2 APIs etc, you’re better off looking at the AWS docs, or Jeff Barr’s book on the subject. Reese’s book is really all about answering difficult questions involving cloud deployments. Continue reading Review: Cloud Application Architectures
Cloud Computing holds a lot of promise, but there are also a lot of speed bumps in the road along the way.
In this six part series we’re going to cover a lot of ground. We don’t intend this series to be an overly technical nuts and bolts howto. Rather we will discuss high level issues and answer questions that come up for CTOs, business managers, and startup CEOs.
Some of the tantalizing issues we’ll address include:
- How do I make sure my application is built for the cloud with scalability baked into the architecture?
- I know disk performance is crucial for my database tier. How do I get the best disk performance with Amazon Web Services & EC2?
- How do I keep my AWS passwords, keys & certificates secure?
- Should I be doing offsite backups as well, or are snapshots enough?
- Cloud providers such as Amazon seem to have poor SLAs (service level agreements). How do I mitigate this using availability zones & regions?
- Cloud hosting environments like Amazons provide no perimeter security. How do I use security groups to ensure my setup is robust and bulletproof?
- Cloud deployments change the entire procurement process, handing a lot of control over to the web operations team. How do I ensure that finance and ops are working together, and a ceiling budget is set and implemented?
- Reliability of Amazon EC2 servers is much lower than traditional hosted servers. Failure is inevitable. How do we use this fact to our advantage, forcing discipline in the deployment and disaster recovery processes? How do I make sure my processes are scripted & firedrill tested?
- Snapshot backups and other data stored in S3 are somewhat less secure than I’d like. Should I use encryption to protect this data? When and where should I use encrypted filesystems to protect my more sensitive data?
- How can I best use availability zones and regions to geographically disperse my data and increase availability?
As we publish each of the individual articles in this series we’ll link them to the titles below. So check back soon!