When migrating to the cloud consider security and resource variability, the cultural shift for operations and the new cost model.
Data centers are complex beasts, and no amount of operator monitoring by itself can keep track of everything. That’s why automated monitoring is so important.
So what should you monitor? You can divide up your monitoring into a couple of strategic areas. Just as with metrics collection, there is business & application level monitoring and then there is lower level system monitoring which is also important.
Business & Application Monitoring
- If a user is getting an error page or cannot connect
- If an e-commerce transaction is failing
- General service outages
- If a business goal is met – or not
- Page timeouts or slowness
Systems Level Monitoring
- Backups completed and success
- Error logs from database, webserver & other major services like email
- Database replication is running
- Webserver timeouts
- Database timeouts
- Replication failures – via error logs & checksum checks
- Memory, CPU, Disk I/O, Server load average
- Network latency
- Network security
Tools that can perform this type of monitoring include Nagios,
Now that we’ve had a chance to take a deep breath after last week’s AWS outage, I’ll offer some comments of my own. Hopefully just enough time has passed to begin to have a broader view, and put events in perspective.
Despite what some reports may have announced, Amazon wasn’t down, but rather a small part of Amazon Web Services went down. A failure, yes. Beyond their service level agreement of 99.95% yes also. Survivable, yes to this last question too.
Learning From Failure
The business management conversation du jour is all about learning from failure, rather than trying to avoid it. Harvard Business Review’s April issue headlined with “The Failure Issue – How to Understand It, Learn From It, and Recover From It”. The economist’s April 16th issue had some similarly interesting pieces one by Schumpeter “Fail often, fail well”,
and another in April 23rd issue “Lessons from Deepwater Horizon and Fukushima”.
With all this talk of failure there is surely one takeaway. Complex systems will fail and it is in the anticipation of that failure that we gain the most. Let’s stop howling and look at how to handle these situations intelligently.
How Do You Rebuild A Website?
In the cloud you will likely need two things. (a) scripts to rebuild all the components in your architecture, spinup servers, fetch source code, fetch software and configuration files, configure load balancers and mount your database and more importantly (b) a database backup from which you can rebuild your current dataset.
Want to stick with EC2, build out your infrastructure in an alternate availability zone or region and you’re back up and running in hours. Or better yet have an alternate cloud provider on hand to handle these rare outages. The choice is yours.
Mitigate risk? Yes indeed failure is more common in the cloud, but recovery is also easier. Failure should pressure the adoption of best practices and force discipline in deployments, not make you more of a gunslinger!
Want to see an extreme example of how this can play in your favor? Read Jeff Atwood’s discussion of so-called Chaos Monkey, a component whose sole job it is to randomly kill off servers in the Netflix environment at random. Now that type of gunslinging will surely keep everyone on their toes! Here’s a Wired article that discusses Chaos Monkey.
George Reese of enStratus discusses the recent failure at length. The I would argue calling Amazon’s outage the Cloud’s Shing Moment, all of his points are wisened and this is the direction we should all be moving.
Going The Way of Commodity Hardware
Though it is still not obvious to everyone, I’ll spell it out loud and clear. Like it or not, the cloud is coming. Look at these numbers.
Furthermore the recent outage also highlights how much and how many internet sites rely on cloud computing, and Amazon EC2.
Way back in 2001 I authored a book on O’Reilly called “Oracle and Open Source”. In it I discussed the technologies I was seeing in the real world. Oracle on the backend and Linux, Apache, and PHP, Perl or some other language on the frontend. These were the technologies that startups were using. They were fast, cheap and with the right smarts reliable too.
Around that time Oracle started smelling the coffee and ported it’s enterprise database to Linux. The equation for them was simple. Customers that were previously paying tons of money to their good friend and confidant Sun for hardware, could now spend 1/10th as much on hardware and shift a lot of that left over cash to – you guessed it Oracle! The hardware wasn’t as good, but who cares because you can get a lot more of it.
Despite a long entrenched and trusted brand like Sun being better and more reliable, guess what? Folks still switched to commodity hardware. Now this is so obvious, no one questions it. But the same trend is happening with cloud computing.
Performance is variable, disk I/O can be iffy, and what’s more the recent outage illustrates front and center, the servers and network can crash at any moment. Who in their right mind would want to move to this platform?
If that’s the question you’re stuck on, you’re still stuck on the old model. You have not truely comprehended the power to build infrastructure with code, to provision through automation, and really embrace managing those components as software. As the internet itself has the ability to route around political strife, and network outages, so too does cloud computing bring that power to mom & pop web shops.
- Have existing investments in hardware? Slow and cautious adoption makes most sense for you.
- Have seasonal traffic variations? An application like this is uniquely suited to the cloud. In fact some of the gaming applications which can autoscale to 10x or 100x servers under load, are newly solveable with the advent of cloud computing.
- Are you currently paying a lot for disaster recovery systems that primarily lay idle. Script your infrastructure for rebuilding from bare metal, and save that part of the budget for more useful projects.
Cloud Computing holds a lot of promise, but there are also a lot of speed bumps in the road along the way.
In this six part series we’re going to cover a lot of ground. We don’t intend this series to be an overly technical nuts and bolts howto. Rather we will discuss high level issues and answer questions that come up for CTOs, business managers, and startup CEOs.
Some of the tantalizing issues we’ll address include:
- How do I make sure my application is built for the cloud with scalability baked into the architecture?
- I know disk performance is crucial for my database tier. How do I get the best disk performance with Amazon Web Services & EC2?
- How do I keep my AWS passwords, keys & certificates secure?
- Should I be doing offsite backups as well, or are snapshots enough?
- Cloud providers such as Amazon seem to have poor SLAs (service level agreements). How do I mitigate this using availability zones & regions?
- Cloud hosting environments like Amazons provide no perimeter security. How do I use security groups to ensure my setup is robust and bulletproof?
- Cloud deployments change the entire procurement process, handing a lot of control over to the web operations team. How do I ensure that finance and ops are working together, and a ceiling budget is set and implemented?
- Reliability of Amazon EC2 servers is much lower than traditional hosted servers. Failure is inevitable. How do we use this fact to our advantage, forcing discipline in the deployment and disaster recovery processes? How do I make sure my processes are scripted & firedrill tested?
- Snapshot backups and other data stored in S3 are somewhat less secure than I’d like. Should I use encryption to protect this data? When and where should I use encrypted filesystems to protect my more sensitive data?
- How can I best use availability zones and regions to geographically disperse my data and increase availability?
As we publish each of the individual articles in this series we’ll link them to the titles below. So check back soon!