Tag Archives: aws

A CTO Must Never Do This…

A couple years back I was contacted to look at a very strange problem.

The firm ran flash sales. An email goes out at noon, the website traffic explodes for a couple of hours, then settles back down to a trickle.

Of course you might imagine where this is going. During that peak, the MySQL database was brought to its knees. I was asked to do analysis during this peak load, and identify and fix problems. Make it go faster, please!

First day on the job I’m working with a team of outsourced DBAs. I was also working with a sort of swat team chatting on SKYPE, while monitoring the systems closely.

Then up popped one comment from a gentlemen I hadn’t worked with. He insisted there was contention for a little known MySQL resource called the AUTO_INC lock. Since I wanted to know more, I asked who the guy was and to my surprise he turned out to be the CTO.

[quote]The CTO was tuning and troubleshooting the database![/quote]

Wow, that’s a first. I thought I’d seen it all. A CTO is normally overseeing technology & the team rather than crawling around in the trenches on the front line.

This all raised some important points

1. The app was having major growing pains
2. Current architecture was not scaling
3. Amazon elasticity was not helping at the database layer
4. People & process were also failing, hence the CTOs hands on approach

It was shocking to see a problem deteriorate to this point, but when you consider the context its understandable. A company like this is struggling with hypergrowth to such a degree, that each day seems like a hurricane storm. With emergency meetings, followed by hardware & application emergencies, trouble seems constant. It can be very difficult to step back and see the larger picture.

The takeaway from this experience…

o Amazon EC2 can’t do it all – consider physical servers for disk intensive apps
o MySQL still has some real scalability limitations
o use technology for its intended purpose – MySQL isn’t great for queueing
o A CTO tuning the database means problems have deteriorated too far

Read all the way to the end? Grab our newsletter – scalable startups.

3 things CEOs should know about the Cloud

You’ve heard all the buzz and spiel about the cloud, and there’re good reasons to want to get there. On-demand compute power makes new levels of scalability possible. Low up front costs means moving capital expenditure to operating expenditure and saving a bundle in the process. We won’t give you anymore of the rah rah marketing hoopla. You’ve heard enough of that. We’ll gently play devil’s advocate for a moment, and give you a few things to think about when deploying applications with a cloud provider. Our focus is mainly on Amazon EC2.

You might also be interested in a wide reaching introduction to deploying on Amazon EC2.

  1. Funky Performance
  2. One of the biggest hurdles we see clients struggle with on Amazon EC2 is performance. This is rooted in the nature of shared resources. Computer servers, just like desktops rely on CPUs, Memory, Network and Disk. In the virtual datacenter, you can be given more than your fair share without you even knowing it. More bandwidth, more CPU, more disk? Who would complain? Well if your application behaves erratically, while you suddenly compete for disk resources you’ll quickly feel the flip side of that coin. Stocks go up, and they can just as easily come right back down.

    Variability around disk I/O seems to be the one that hits applications the hardest, especially the database tier of many web applications. If your application requires extremely high database transaction throughput, you would do well to consider physical servers and a real RAID array to host your database server. Read more about IOPs

  3. Uncertain Reliability – A Loaded Gun
  4. Everybody has heard the saying, don’t hand someone a loaded gun. In the case of Amazon servers, you really do load your applications onto fickle and neurotic servers.

    Imagine you open a car rental business. You could have two brand new fully reliable cars to rent out to customers. Your customers would be very happy, but you’d have a very small business. Alternatively you could have twenty used Pintos. You’d have some breaking down a lot, but as long as you keep ten of them rented at a time, your business is booming.

    In the Amazon world you have all the tools to keep your Ford Pintos running, but it’s important to think long and hard about reliability, redundancy, and automation. Read more about Failures, Lessons & the Chaos Monkey

  5. Iffy Support
  6. Managed hosting providers vary drastically in terms of the support you can expect. Companies like Rackspace, Servint or Datapipe have Support built into their DNA. They’ve grown up around having a support tech that your team can reach when they’re having trouble.

    Amazon takes the opposite approach. They give you all the tools to do everything yourself. But in a crunch it can be great to have that service available to help troubleshoot and diagnose a problem. Although they’re now offering support contracts, it’s not how they started out.

    If you have a crack operations team at your disposal, or you hire a third party provider like Heavyweight Internet Group Amazon Web Services gives you the flexibility and power to build phenomenal and scalable architectures. But if you’re a very small team without tons of technical know-how, you may well do better with a service-oriented provider like Rackspace et al.

A few more considerations…

  • Will your cloud provider go out of business?
  • Could a Subpoena against your provider draw you into the net?
  • Since you don’t know where your sensitive data is, should you consider encryption?
  • Should I keep additional backups outside of the cloud?
  • Should I use multiple cloud providers?
  • Should I be concerned about the lack of perimeter security?

3 Ways to Boost Cloud Scalability

Deploying in the Amazon cloud is touted as a great way to achieve high scalability while paying only for the computing power you use. How do you get the best scalability from the technology? Continue reading 3 Ways to Boost Cloud Scalability

Review – Test Driven Infrastructure with Chef – Stephen Nelson-Smith

In search of a good book on Chef itself, I picked up this new title on O’Reilly.  It’s one of their new format books, small in size, only 75 pages.

There was some very good material in this book.  Mr. Nelson-Smith’s writing style is good, readable, and informative.  The discussion of risks of infrastructure as code was instructive.  With the advent of APIs to build out virtual data centers, the idea of automating every aspect of systems administration, and building infrastructure itself as code is a new one.  So an honest discussion of the risks of such an approach is bold and much needed.  I also liked the introduction to Chef itself, and the discussion of installation.

Chef isn’t really the main focus of this book, unfortunately.  The book spends a lot of time introducing us to Agile Development, and specifically test driven development.  While these are lofty goals, and the first time I’ve seen treatment of the topic in relation to provisioning cloud infrastructure, I did feel too much time was spent on that.  Continue reading Review – Test Driven Infrastructure with Chef – Stephen Nelson-Smith

Root Cause Analysis – What is it and why is it important?

Root Cause Analysis is the means to identify the ultimate source and cause of an outage.  When an outage occurs that causes serious downtime of a website, typically organizations are in crisis mode.  Urgency of resolution sometimes pushes aside due process, change management and general caution.  Root Cause Analysis attempts to as much as possible isolate logfiles, configurations, and the current state of systems for later analysis.

With traditional physical servers, physical hardware failure, operator error, or a security breach can cause outages.  Since you’re dealing with one physical machine, resolving that issue necessarily means moving around the things that broke.  So caution and later analysis must be balanced with the immediate problem resolution.

Another silver lining in cloud hosted solutions is around root cause analysis.  If a server was breached for example, that server can immediately be shutdown, while maintaining it’s current state as a disk or EBS snapshot.  A new server can then be fired up from a AMI image, then your server rebuilt from scripts or template and you’re back up and running.  Save the snapshot then for later analysis.

This could be used for analysis of operator error related outages as well.  Hardware failures are more expected and common in cloud hosted environments, so this should and really must push adoption of best practices around infrastructure, that is having scripts at hand that rebuild everything from bare metal.

More discussion of root cause analysis by Sean Hull on Quora.

Decoupling – What is it and why is it important?

Processes are said to be coupled when they are tightly wound together, and dependent on one another.

A loose analogy might be replacing a traffic light by a traffic circle.  You keep the traffic moving, reducing the overall wait time for any car entering the intersection.

Decoupling web applications might involve replacing a makeshift queue your application currently implements in a table, with a message queuing service such as RabbitMQ or Amazon’s SQS.

Ultimately decoupling promotes scalability, as you can scale the pieces of your infrastructure that your capacity planning identifies to be bottlenecks.  What’s more you can make those pieces redundant, increasing high availability at the same time.

Sean Hull discusses on Quora: What is decoupling and why is it important?

Auto-scaling – What is it and why is it important?

With cloud-based hosting solutions, new servers can be provisioned and “spun up” with a few options on the command line.  This opens a whole new dimension for infrastructure, allowing software scripts to bring new computing power into your web infrastructure.

Internet based applications often exhibit seasonal traffic patterns where traffic stays steady or grows slowly over a period, but then experiences a sharp spike in demand requiring much higher computing resources to meet customer demand.

Enter auto-scaling, an even more powerful feature of cloud-based offerings.  Define roles for your webservers and database servers, set capacity rules that control how much traffic will trigger new servers to be rolled out, and watch your infrastructure scale automatically to meet the needs of your internet application.

What is disaster recovery and why is it important?

Disaster recovery involves the anticipation of major business outage, and the contingency planning to avoid business loss in revenue, customers or sales.

All of the technology components that make up your enterprise applications should be carefully considered against loss.  What happens if this database server disappears?  Do we have all the data backed up somewhere?  Have we tested that backup to restore it?  How long does it take to restore?  Can we reconnect the application to said database?  What if the network goes down?  How about if the whole datacenter goes out?

Planning for disaster recovery is important whether you’re hosted in-house or with a hosting provider.  Consider Amazon’s EC2 outage in April.  Various availability zones went out.  Were affected customers to have their database backed up properly – with offsite & tested copies, and further if they had other components such as webserver document roots, software configurations, etc they would be able to rebuild their entire infrastructure in an alternate availability zone or region.  Remember it was only a small component of Amazon Web Services which was out.

Sean Hull asks on Quora: Disaster Recovery – What is it and why is it important?

iHeavy Insights 79 – Plumbing the Interwebs

I meet new people all the time.  It’s a way of life in New York.  One of the first questions new people ask each other is “What do you do?”.  It begins to sound like a cliche after a while, but it can also provide endless fascinating discussions as there are so many people with different professions in New York.  Some choose a titled answer “i’m an investment banker”, “I’m an emcee”, “I’m an executive recruiter”.  I find for “Web Scalability Consultant” or “Web Operations Expert” this only leaves confused looks.

A Plumber By Another Name

The solution of course is to tell a good story.  Stories illustrate what titles and crusty vernacular cannot.  I’ve used analogies to surgeons or mechanics, of course they all operate on something people can related to in front of them.  People or vehicles we use everyday.  Of course with the internet, there is a huge hidden infrastructure that most people don’t see everyday.  They may vaguely know it’s there, but it’s still hidden out of site.

That’s why I think plumbing provides such an apt visual.  As it turns out the internet is built with countless data pipes both large and small, coming into your home or laying across the bottom of the transatlantic ocean.  These pipes plug into routers, high speed traffic lights and traffic cops.  Ultimately they feed into datacenters, huge rooms filled with racks of computers, holding your websites crown jewels.  Therein contains the images and status updates from your facebook profile, your banking transactions from your personal bank account or credit card, your netflix movie stream, or the email you sent via gmail.  Even your instant messaging stream, or the data from your favorite iphone app are all stored and retrieved from here.

Amazon Outage

The recent Amazon outage has been high profile enough that a lot of folks who don’t follow the latest trends in web operations, devops, and datacenter automation still heard about this event.  Turns out it’s had a silver lining for Amazon cause now everyone is scrutinizing how many sites actually rely on this goliath of a hosting provider.

As it turns out the root of the amazon outage was indeed a plumbing problem.  Amazon has shown rather high transparency publishing intimate details of the problem and it’s resolution.  Read more.

A misconfigured network cascaded through the system creating countless failures.  If you imagine water repairs being done in a large New York City building, they often ask tenants to turn off their water, so they won’t all come on at the same time when service is restored.  SImilarly intricate problems complicated the Amazon effort, slowing down attempts to restore everything after the incident.  I wrote at length about the outage if you’re interested, read more.

BOOK REVIEW:  Game-Based Marketing by Zicherman & Linder

There are so many new books coming out all the time, it’s tough to sift and find the good ones.  Anyone with a website as their storefront, whether they are a product company or a services company, can gain from reading this book.

From leaderboards to frequent flyer programs, badges and more this book is full of real-world examples where game-based principles are put into action.  On the internet where attention is a rarer and rarer commodity, these concepts will surely make a big difference to your business.

Amazon book link – Game Based Marketing

Amazon EC2 Outage – Failures, Lessons and Cloud Deployments

Now that we’ve had a chance to take a deep breath after last week’s AWS outage, I’ll offer some comments of my own.  Hopefully just enough time has passed to begin to have a broader view, and put events in perspective.
Despite what some reports may have announced, Amazon wasn’t down, but rather a small part of Amazon Web Services went down.  A failure, yes.  Beyond their service level agreement of 99.95% yes also.  Survivable, yes to this last question too.

Learning From Failure

The business management conversation du jour is all about learning from failure, rather than trying to avoid it.  Harvard Business Review’s April issue headlined with “The Failure Issue – How to Understand It, Learn From It, and Recover From It”.  The economist’s April 16th issue had some similarly interesting pieces one by Schumpeter “Fail often, fail well”,
and another in April 23rd issue “Lessons from Deepwater Horizon and Fukushima”.
With all this talk of failure there is surely one takeaway.  Complex systems will fail and it is in the anticipation of that failure that we gain the most.  Let’s stop howling and look at how to handle these situations intelligently.

How Do You Rebuild A Website?

In the cloud you will likely need two things.  (a) scripts to rebuild all the components in your architecture, spinup servers, fetch source code, fetch software and configuration files, configure load balancers and mount your database and more importantly (b) a database backup from which you can rebuild your current dataset.

Want to stick with EC2, build out your infrastructure in an alternate availability zone or region and you’re back up and running in hours.  Or better yet have an alternate cloud provider on hand to handle these rare outages.  The choice is yours.

Mitigate risk?  Yes indeed failure is more common in the cloud, but recovery is also easier.  Failure should pressure the adoption of best practices and force discipline in deployments, not make you more of a gunslinger!

Want to see an extreme example of how this can play in your favor?  Read Jeff Atwood’s discussion of so-called Chaos Monkey, a component whose sole job it is to randomly kill off servers in the Netflix environment at random.  Now that type of gunslinging will surely keep everyone on their toes!  Here’s a Wired article that discusses Chaos Monkey.

George Reese of enStratus discusses the recent failure at length.  The I would argue calling Amazon’s outage the Cloud’s Shing Moment, all of his points are wisened and this is the direction we should all be moving.

Going The Way of Commodity Hardware

Though it is still not obvious to everyone, I’ll spell it out loud and clear.  Like it or not, the cloud is coming.  Look at these numbers.

Furthermore the recent outage also highlights how much and how many internet sites rely on cloud computing, and Amazon EC2.
Way back in 2001 I authored a book on O’Reilly called “Oracle and Open Source”.  In it I discussed the technologies I was seeing in the real world.  Oracle on the backend and Linux, Apache, and PHP, Perl or some other language on the frontend.  These were the technologies that startups were using.  They were fast, cheap and with the right smarts reliable too.

Around that time Oracle started smelling the coffee and ported it’s enterprise database to Linux.  The equation for them was simple.  Customers that were previously paying tons of money to their good friend and confidant Sun for hardware, could now spend 1/10th as much on hardware and shift a lot of that left over cash to – you guessed it Oracle!  The hardware wasn’t as good, but who cares because you can get a lot more of it.

Despite a long entrenched and trusted brand like Sun being better and more reliable, guess what?  Folks still switched to commodity hardware.  Now this is so obvious, no one questions it.  But the same trend is happening with cloud computing.

Performance is variable, disk I/O can be iffy, and what’s more the recent outage illustrates front and center, the servers and network can crash at any moment.  Who in their right mind would want to move to this platform?

If that’s the question you’re stuck on, you’re still stuck on the old model.  You have not truely comprehended the power to build infrastructure with code, to provision through automation, and really embrace managing those components as software.  As the internet itself has the ability to route around political strife, and network outages, so too does cloud computing bring that power to mom & pop web shops.

Conclusions

  • Have existing investments in hardware?  Slow and cautious adoption makes most sense for you.
  • Have seasonal traffic variations?  An application like this is uniquely suited to the cloud.  In fact some of the gaming applications which can autoscale to 10x or 100x servers under load, are newly solveable with the advent of cloud computing.
  • Are you currently paying a lot for disaster recovery systems that primarily lay idle.  Script your infrastructure for rebuilding from bare metal, and save that part of the budget for more useful projects.