8 Questions to ask an AWS Expert

via GIPHY

If you’re headhunting a cloud computing expert, specifically someone who knows Amazon Web Services (AWS) and EC2, you’ll want to have a battery of questions to ask them to assess their knowledge.  As with any technical interview focus on concepts and big picture.  As the 37Signals folks like to say “hire for attitude, train for skill”.  Absolutely!





New: Top questions for hiring a serverless lambda expert

Also new: Top questions to ask on a devops expert interview

And: How to hire a developer that doesn’t suck

If you want more general info about Amazon Web Services, read our Intro to EC2 Deployments.

    1. Explain Elastic Block Storage?  What type of performance can you expect?  How do you back it up?  How do you improve performance?

    EBS is a virtualized SAN or storage area network.  That means it is RAID storage to start with so it’s redundant and fault tolerant.  If disks die in that RAID you don’t lose data.  Great!  It is also virtualized, so you can provision and allocate storage, and attach it to your server with various API calls.  No calling the storage expert and asking him or her to run specialized commands from the hardware vendor.

    Need help? Check out my pricing page or email sean@iheavy.com

    Performance on EBS can exhibit variability.  That is it can go above the SLA performance level, then drop below it.  The SLA provides you with an average disk I/O rate you can expect.  This can frustrate some folks especially performance experts who expect reliable and consistent disk throughput on a server.  Traditional physically hosted servers behave that way.  Virtual AWS instances do not.

    Related: Is Amazon too big to fail?

    Backup EBS volumes by using the snapshot facility via API call or via a GUI interface like elasticfox.

    Improve performance by using Linux software raid and striping across four volumes.

    2. What is S3?  What is it used for? Should encryption be used?

    S3 stands for Simple Storage Service.  You can think of it like ftp storage, where you can move files to and from there, but not mount it like a filesystem.  AWS automatically puts your snapshots there, as well as AMIs there.  Encryption should be considered for sensitive data, as S3 is a proprietary technology developed by Amazon themselves, and as yet unproven vis-a-vis a security standpoint.

    Also: How careful notes are helping me work better with clients

    3. What is an AMI?  How do I build one?

    AMI stands for Amazon Machine Image.  It is effectively a snapshot of the root filesystem.  Commodity hardware servers have a bios that points the the master boot record of the first block on a disk.  A disk image though can sit anywhere physically on a disk, so Linux can boot from an arbitrary location on the EBS storage network.

    Need an AWS expert? Email me for a quote hullsean @ gmail.com

    Build a new AMI by first spinning up and instance from a trusted AMI.  Then adding packages and components as required.  Be wary of putting sensitive data onto an AMI.  For instance your access credentials should be added to an instance after spinup.  With a database, mount an outside volume that holds your MySQL data after spinup as well.

    4. Can I vertically scale an Amazon instance? How?

    Yes.  This is an incredible feature of AWS and cloud virtualization.  Spinup a new larger instance than the one you are currently running.  Pause that instance and detach the root ebs volume from this server and discard.  Then stop your live instance, detach its root volume.  Note the unique device ID and attach that root volume to your new server.   And the start it again.  Voila you have scaled vertically in-place!!

    5. What is auto-scaling? How does it work?

    Autoscaling is a feature of AWS which allows you to configure and automatically provision and spinup new instances without the need for your intervention.  You do this by setting thresholds and metrics to monitor.  When those thresholds are crossed a new instance of your choosing will be spun up, configured, and rolled into the load balancer pool.  Voila you’ve scaled horizontally without any operator intervention!

    Also: Are we fast approaching cloud-mageddon?

    With MySQL databases autoscaling can get a little dicey, so we wrote a guide to autoscaling MySQL on amazon EC2.

    6. What automation tools can I use to spinup servers?

    The most obvious way is to roll-your-own scripts, and use the AWS API tools.  Such scripts could be written in bash, python or another language or your choice.  Next option is to use a configuration management and provisioning tool like puppet or better it’s successor Opscode Chef. Ansible is also an excellent option because it doesn’t require an agent, and can run your shell scripts as-is.  You might also look towards CloudFormation or Terraform. The resulting code captures your entire infrastructure, can be checked into your git repository & version controlled. You can even unit test this way! 

    7. What is configuration management?  Why would I want to use it with cloud provisioning of resources?

    Configuration management has been around for a long time in web operations and systems administration.  Yet the cultural popularity of it has been limited.  Most systems administrators configure machines as software was developed before version control – that is manually making changes on servers.  Each server can then and usually is slightly different.  Troubleshooting though is straightforward as you login to the box and operate on it directly.  Configuration management brings a large automation tool into the picture, managing servers like strings of a puppet.  This forces standardization, best practices, and reproducibility as all configs are versioned and managed.  It also introduces a new way of working which is the biggest hurdle to its adoption.

    Read: When hosting data on Amazon turns bloodsport

    Enter the cloud, and configuration management becomes even more critical.  That’s because virtual servers such as amazons EC2 instances are much less reliable than physical ones.  You absolutely need a mechanism to rebuild them as-is at any moment.  This pushes best practices like automation, reproducibility and disaster recovery into center stage.

    While on the subject of configuration management take a quick peek at hiring a devops guide.

    8. Explain how you would simulate perimeter security using Amazon Web Services model?

    Traditional perimeter security that we’re already familiar with using firewalls and so forth is not supported in the Amazon EC2 world.  AWS supports security groups.  One can create a security group for a jump box with ssh access – only port 22 open.  From there a webserver group and database group are created.  The webserver group allows 80 and 443 from the world, but port 22 *only* from the jump box group.  Further the database group allows port 3306 from the webserver group and port 22 from the jump box group.  Add any machines to the webserver group and they can all hit the database.  No one from the world can, and no one can directly ssh to any of your boxes.

    The more full featured way to go is VPC. That’s Amazon’s acronym for virtual private cloud. You can create virtual networks both private & public, with subnets etc all within VPCs. You then spinup servers & resources inside those virtual networks. VPCs can be control with security groups or the more powerful but messy access control lists.

    Also: A history lesson for cloud detractors – January 2012

    Want to further lock this configuration down?  Only allow ssh access from specific IP addresses on your network, or allow just your subnet.

Did you make it this far?!?! Grab our newsletter.

The New Commodity Hardware Craze aka Cloud Computing

Does anyone remember 15 years ago when the dot-com boom was just starting?  A lot of companies were running on Sun.  Sun was the best hardware you could buy for the price.  It was reliable and a lot of engineers had experience with the operating system, SunOS a flavor of Unix.

Yet suddenly companies were switching to cheap crappy hardware.  The stuff failed more often, had lower quality control, and cheaper and slower buses.  Despite all of that, cutting edge firms and startups were moving to commodity hardware in droves.  Why was it so? Continue reading “The New Commodity Hardware Craze aka Cloud Computing”

7 Ways to Troubleshoot MySQL

MySQL databases are great work horses of the internet.  They back tons of modern websites, from blogs and checkout carts, to huge sites like Facebook.  But these technologies don’t run themselves.  When you’re faced with a system that is slowing down, you’ll need the right tools to diagnose and troubleshoot the problem.  MySQL has a huge community following and that means scores of great tools for your toolbox. Here are 7 ways to troubleshoot MySQL. Continue reading “7 Ways to Troubleshoot MySQL”

5 Ways to Avoid EC2 Outages

1. Backup outside of the Cloud

Some of the high profile companies affected by Amazon’s April 2011 outage could have recovered had they kept a backup of their entire site outside of the cloud.  With any hosting provider, managed traditional data center or cloud provider, alternate backups are always a good idea.  A MySQL logical backup and/or incremental backup can be copied regularly offsite or to an alternate cloud provider.  That’s real insurance! Continue reading “5 Ways to Avoid EC2 Outages”

Review – Test Driven Infrastructure with Chef – Stephen Nelson-Smith

In search of a good book on Chef itself, I picked up this new title on O’Reilly.  It’s one of their new format books, small in size, only 75 pages.

There was some very good material in this book.  Mr. Nelson-Smith’s writing style is good, readable, and informative.  The discussion of risks of infrastructure as code was instructive.  With the advent of APIs to build out virtual data centers, the idea of automating every aspect of systems administration, and building infrastructure itself as code is a new one.  So an honest discussion of the risks of such an approach is bold and much needed.  I also liked the introduction to Chef itself, and the discussion of installation.

Chef isn’t really the main focus of this book, unfortunately.  The book spends a lot of time introducing us to Agile Development, and specifically test driven development.  While these are lofty goals, and the first time I’ve seen treatment of the topic in relation to provisioning cloud infrastructure, I did feel too much time was spent on that.  Continue reading “Review – Test Driven Infrastructure with Chef – Stephen Nelson-Smith”

Amazon Web Services – What is it and why is it important?

Amazon Web Services is a division of Amazon the bookseller, but this part of the business is devoted solely to infrastructure and internet servers.  These are the building blocks of data centers, the workhorses of the internet.  AWS’s offering of Cloud Computing solutions allows a business to setup or “spinup” in the jargon of cloud computing, new compute resources at will.  Need a small single cpu 32bit ubuntu server with two 20G disks attached?  One command and 30 seconds away, and you can have that!

As we discussed previously, Infrastructure Provisioning has evolved dramatically over the past fifteen years from something took time and cost a lot, to a fast automatic process that it is today with cloud computing.  This has also brought with it a dramatic culture shift in the way that systems administration is being done, from a fairly manual process of physical machines, and software configuration, one that took weeks to setup new services, to a scriptable and automateable process that can then take seconds.

This new realm of cloud computing infrastructure and provisioning is called Infrastructure as a Service or IaaS, and Amazon Web Services is one of the largest providers of such compute resources.  They’re not the only ones of course.  Others include:

  • Rackspace Cloud
  • Joyent
  • GoGrid
  • Terremark
  • 3Tera
  • IBM
  • Microsoft
  • Enomaly
  • AT&T

Cloud Computing is still in it’s infancy, but is growing quickly.   Amazon themselves had a major data center outage in April that we discussed in detail. It sent some hot internet startups into a tailspin!

More discussion of Amazon Web Services on Quora – Sean Hull

IOPs – What is it and why is it important?

IOPs are an attempt to standardize comparison of disk speeds across different environments.  When you turn on a computer, everything must be read from disk, but thereafter things are kept in memory.  However applications typically read and write to disk frequently.  When you move to enterprise class applications, especially relational databases, a lot of disk I/O is happening so performance of disk resources is crucial.

For a basic single SATA drive that you might have in server or laptop, you can typically get 30-40 IOPs from it.  These numbers vary if you are talking about random versus sequential reads or writes.  Picture the needle on a vinyl record.  It moves quicker around the center, and slower around the outside.  That’s what’s happening the the magnetic needle inside your harddrive too.

In Amazon EC2 environment, there is a lot of variability in performance from EBS.  You can stripe across four separate EBS volumes which will be on four different locations on the underlying RAID array and you’ll get a big boost in disk I/O.  Also disk performance will vary from an m1.small, m1.large and m1.xlarge instance type, with the latter getting the lions share of network bandwidth, so better disk I/O performance.  But in the end your best EBS performance will be in the range of 500-1000 IOPs.  That’s not huge by physical hardware standards, so an extremely disk intensive application will probably not perform well in the Amazon cloud.

Still the economic pressures and infrastructure and business flexibility continue to push cloud computing adoption, so expect the trend to continue.

Quora discussion – What are IOPs and why are they important?

Migrating to the Cloud – Why and why not?

A lot of technical forums and discussions have highlighted the limitations of EC2 and how it loses  on performance when compared to physical servers of equal cost.  They argue that you can get much more hardware and bigger iron for the same money.  So it then seems foolhardy to turn to the cloud.  Why this mad rush to the cloud then?  Of course if all you’re looking at is performance, it might seem odd indeed.  But another way of looking at it is, if performance is not as good, it’s clearly not the driving factor to cloud adoption.

CIOs and CTOs are often asking questions more along the lines of, “Can we deploy in the cloud and settle with the performance limitations, and if so how do we get there?”

Another question, “Is it a good idea to deploy your database in the cloud?”  It depends!  Let’s take a look at some of the strengths and weaknesses, then you decide.

8 big strengths of the cloud

  1. Flexibility in disaster recovery – it becomes a script, no need to buy additional hardware
  2. Easier roll out of patches and upgrades
  3. Reduced operational headache – scripting and automation becomes central
  4. Uniquely suited to seasonal traffic patterns – keep online only the capacity you’re using
  5. Low initial investment
  6. Auto-scaling – set thresholds and deploy new capacity automatically
  7. Easy compromise response – take server offline and spinup a new one
  8. Easy setup of dev, qa & test environments

Some challenges with deploying in the cloud

  1. Big cultural shift in how operations is done
  2. Lower SLAs and less reliable virtual servers – mitigate with automation
  3. No perimeter security – new model for managing & locking down servers
  4. Where is my data?  — concerns over compliance and privacy
  5. Variable disk performance – can be problematic for MySQL databases
  6. New procurement process can be a hurdle

Many of these challenges can be mitigated against.  The promise of the infrastructure deployed in the cloud is huge, so digging our heels in with gradual adoption is perhaps the best option for many firms.  Mitigate the weaknesses of the cloud by:

  • Use encrypted filesystems and backups where necessary
  • Also keep offsite backups inhouse or at an alternate cloud provider
  • Mitigate against EBS performance – cache at every layer of your application stack
  • Employ configuration management & automation tools such as Puppet & Chef

Quora discussion – Why or why not to migrate to the cloud?

Amazon EC2 Outage – Failures, Lessons and Cloud Deployments

Now that we’ve had a chance to take a deep breath after last week’s AWS outage, I’ll offer some comments of my own.  Hopefully just enough time has passed to begin to have a broader view, and put events in perspective.
Despite what some reports may have announced, Amazon wasn’t down, but rather a small part of Amazon Web Services went down.  A failure, yes.  Beyond their service level agreement of 99.95% yes also.  Survivable, yes to this last question too.

Learning From Failure

The business management conversation du jour is all about learning from failure, rather than trying to avoid it.  Harvard Business Review’s April issue headlined with “The Failure Issue – How to Understand It, Learn From It, and Recover From It”.  The economist’s April 16th issue had some similarly interesting pieces one by Schumpeter “Fail often, fail well”,
and another in April 23rd issue “Lessons from Deepwater Horizon and Fukushima”.
With all this talk of failure there is surely one takeaway.  Complex systems will fail and it is in the anticipation of that failure that we gain the most.  Let’s stop howling and look at how to handle these situations intelligently.

How Do You Rebuild A Website?

In the cloud you will likely need two things.  (a) scripts to rebuild all the components in your architecture, spinup servers, fetch source code, fetch software and configuration files, configure load balancers and mount your database and more importantly (b) a database backup from which you can rebuild your current dataset.

Want to stick with EC2, build out your infrastructure in an alternate availability zone or region and you’re back up and running in hours.  Or better yet have an alternate cloud provider on hand to handle these rare outages.  The choice is yours.

Mitigate risk?  Yes indeed failure is more common in the cloud, but recovery is also easier.  Failure should pressure the adoption of best practices and force discipline in deployments, not make you more of a gunslinger!

Want to see an extreme example of how this can play in your favor?  Read Jeff Atwood’s discussion of so-called Chaos Monkey, a component whose sole job it is to randomly kill off servers in the Netflix environment at random.  Now that type of gunslinging will surely keep everyone on their toes!  Here’s a Wired article that discusses Chaos Monkey.

George Reese of enStratus discusses the recent failure at length.  The I would argue calling Amazon’s outage the Cloud’s Shing Moment, all of his points are wisened and this is the direction we should all be moving.

Going The Way of Commodity Hardware

Though it is still not obvious to everyone, I’ll spell it out loud and clear.  Like it or not, the cloud is coming.  Look at these numbers.

Furthermore the recent outage also highlights how much and how many internet sites rely on cloud computing, and Amazon EC2.
Way back in 2001 I authored a book on O’Reilly called “Oracle and Open Source”.  In it I discussed the technologies I was seeing in the real world.  Oracle on the backend and Linux, Apache, and PHP, Perl or some other language on the frontend.  These were the technologies that startups were using.  They were fast, cheap and with the right smarts reliable too.

Around that time Oracle started smelling the coffee and ported it’s enterprise database to Linux.  The equation for them was simple.  Customers that were previously paying tons of money to their good friend and confidant Sun for hardware, could now spend 1/10th as much on hardware and shift a lot of that left over cash to – you guessed it Oracle!  The hardware wasn’t as good, but who cares because you can get a lot more of it.

Despite a long entrenched and trusted brand like Sun being better and more reliable, guess what?  Folks still switched to commodity hardware.  Now this is so obvious, no one questions it.  But the same trend is happening with cloud computing.

Performance is variable, disk I/O can be iffy, and what’s more the recent outage illustrates front and center, the servers and network can crash at any moment.  Who in their right mind would want to move to this platform?

If that’s the question you’re stuck on, you’re still stuck on the old model.  You have not truely comprehended the power to build infrastructure with code, to provision through automation, and really embrace managing those components as software.  As the internet itself has the ability to route around political strife, and network outages, so too does cloud computing bring that power to mom & pop web shops.

Conclusions

  • Have existing investments in hardware?  Slow and cautious adoption makes most sense for you.
  • Have seasonal traffic variations?  An application like this is uniquely suited to the cloud.  In fact some of the gaming applications which can autoscale to 10x or 100x servers under load, are newly solveable with the advent of cloud computing.
  • Are you currently paying a lot for disaster recovery systems that primarily lay idle.  Script your infrastructure for rebuilding from bare metal, and save that part of the budget for more useful projects.