Category Archives: All

iHeavy Insights 79 – Plumbing the Interwebs

I meet new people all the time.  It’s a way of life in New York.  One of the first questions new people ask each other is “What do you do?”.  It begins to sound like a cliche after a while, but it can also provide endless fascinating discussions as there are so many people with different professions in New York.  Some choose a titled answer “i’m an investment banker”, “I’m an emcee”, “I’m an executive recruiter”.  I find for “Web Scalability Consultant” or “Web Operations Expert” this only leaves confused looks.

A Plumber By Another Name

The solution of course is to tell a good story.  Stories illustrate what titles and crusty vernacular cannot.  I’ve used analogies to surgeons or mechanics, of course they all operate on something people can related to in front of them.  People or vehicles we use everyday.  Of course with the internet, there is a huge hidden infrastructure that most people don’t see everyday.  They may vaguely know it’s there, but it’s still hidden out of site.

That’s why I think plumbing provides such an apt visual.  As it turns out the internet is built with countless data pipes both large and small, coming into your home or laying across the bottom of the transatlantic ocean.  These pipes plug into routers, high speed traffic lights and traffic cops.  Ultimately they feed into datacenters, huge rooms filled with racks of computers, holding your websites crown jewels.  Therein contains the images and status updates from your facebook profile, your banking transactions from your personal bank account or credit card, your netflix movie stream, or the email you sent via gmail.  Even your instant messaging stream, or the data from your favorite iphone app are all stored and retrieved from here.

Amazon Outage

The recent Amazon outage has been high profile enough that a lot of folks who don’t follow the latest trends in web operations, devops, and datacenter automation still heard about this event.  Turns out it’s had a silver lining for Amazon cause now everyone is scrutinizing how many sites actually rely on this goliath of a hosting provider.

As it turns out the root of the amazon outage was indeed a plumbing problem.  Amazon has shown rather high transparency publishing intimate details of the problem and it’s resolution.  Read more.

A misconfigured network cascaded through the system creating countless failures.  If you imagine water repairs being done in a large New York City building, they often ask tenants to turn off their water, so they won’t all come on at the same time when service is restored.  SImilarly intricate problems complicated the Amazon effort, slowing down attempts to restore everything after the incident.  I wrote at length about the outage if you’re interested, read more.

BOOK REVIEW:  Game-Based Marketing by Zicherman & Linder

There are so many new books coming out all the time, it’s tough to sift and find the good ones.  Anyone with a website as their storefront, whether they are a product company or a services company, can gain from reading this book.

From leaderboards to frequent flyer programs, badges and more this book is full of real-world examples where game-based principles are put into action.  On the internet where attention is a rarer and rarer commodity, these concepts will surely make a big difference to your business.

Amazon book link – Game Based Marketing

Amazon EC2 Outage – Failures, Lessons and Cloud Deployments

Now that we’ve had a chance to take a deep breath after last week’s AWS outage, I’ll offer some comments of my own.  Hopefully just enough time has passed to begin to have a broader view, and put events in perspective.
Despite what some reports may have announced, Amazon wasn’t down, but rather a small part of Amazon Web Services went down.  A failure, yes.  Beyond their service level agreement of 99.95% yes also.  Survivable, yes to this last question too.

Learning From Failure

The business management conversation du jour is all about learning from failure, rather than trying to avoid it.  Harvard Business Review’s April issue headlined with “The Failure Issue – How to Understand It, Learn From It, and Recover From It”.  The economist’s April 16th issue had some similarly interesting pieces one by Schumpeter “Fail often, fail well”,
and another in April 23rd issue “Lessons from Deepwater Horizon and Fukushima”.
With all this talk of failure there is surely one takeaway.  Complex systems will fail and it is in the anticipation of that failure that we gain the most.  Let’s stop howling and look at how to handle these situations intelligently.

How Do You Rebuild A Website?

In the cloud you will likely need two things.  (a) scripts to rebuild all the components in your architecture, spinup servers, fetch source code, fetch software and configuration files, configure load balancers and mount your database and more importantly (b) a database backup from which you can rebuild your current dataset.

Want to stick with EC2, build out your infrastructure in an alternate availability zone or region and you’re back up and running in hours.  Or better yet have an alternate cloud provider on hand to handle these rare outages.  The choice is yours.

Mitigate risk?  Yes indeed failure is more common in the cloud, but recovery is also easier.  Failure should pressure the adoption of best practices and force discipline in deployments, not make you more of a gunslinger!

Want to see an extreme example of how this can play in your favor?  Read Jeff Atwood’s discussion of so-called Chaos Monkey, a component whose sole job it is to randomly kill off servers in the Netflix environment at random.  Now that type of gunslinging will surely keep everyone on their toes!  Here’s a Wired article that discusses Chaos Monkey.

George Reese of enStratus discusses the recent failure at length.  The I would argue calling Amazon’s outage the Cloud’s Shing Moment, all of his points are wisened and this is the direction we should all be moving.

Going The Way of Commodity Hardware

Though it is still not obvious to everyone, I’ll spell it out loud and clear.  Like it or not, the cloud is coming.  Look at these numbers.

Furthermore the recent outage also highlights how much and how many internet sites rely on cloud computing, and Amazon EC2.
Way back in 2001 I authored a book on O’Reilly called “Oracle and Open Source”.  In it I discussed the technologies I was seeing in the real world.  Oracle on the backend and Linux, Apache, and PHP, Perl or some other language on the frontend.  These were the technologies that startups were using.  They were fast, cheap and with the right smarts reliable too.

Around that time Oracle started smelling the coffee and ported it’s enterprise database to Linux.  The equation for them was simple.  Customers that were previously paying tons of money to their good friend and confidant Sun for hardware, could now spend 1/10th as much on hardware and shift a lot of that left over cash to – you guessed it Oracle!  The hardware wasn’t as good, but who cares because you can get a lot more of it.

Despite a long entrenched and trusted brand like Sun being better and more reliable, guess what?  Folks still switched to commodity hardware.  Now this is so obvious, no one questions it.  But the same trend is happening with cloud computing.

Performance is variable, disk I/O can be iffy, and what’s more the recent outage illustrates front and center, the servers and network can crash at any moment.  Who in their right mind would want to move to this platform?

If that’s the question you’re stuck on, you’re still stuck on the old model.  You have not truely comprehended the power to build infrastructure with code, to provision through automation, and really embrace managing those components as software.  As the internet itself has the ability to route around political strife, and network outages, so too does cloud computing bring that power to mom & pop web shops.

Conclusions

  • Have existing investments in hardware?  Slow and cautious adoption makes most sense for you.
  • Have seasonal traffic variations?  An application like this is uniquely suited to the cloud.  In fact some of the gaming applications which can autoscale to 10x or 100x servers under load, are newly solveable with the advent of cloud computing.
  • Are you currently paying a lot for disaster recovery systems that primarily lay idle.  Script your infrastructure for rebuilding from bare metal, and save that part of the budget for more useful projects.

Cloud Computing Use Cases

Cloud Computing may not make sense for all application types.  But as with the adoption of commodity hardware and Linux over a decade ago, economic considerations will continue to pressure adoption.

This article is part of a multi-part series Intro to EC2 Cloud Deployments

What types of applications do fit well in the cloud?

o Applications with Seasonal Traffic Patterns
o Proof-of-concept Applications
o Quick Temporary Dev & Test Environments
o CPU Intensive Applications
o On-Demand or Unknown Future Demand

Seasonal Traffic Patterns

Web applications often show the following traffic patterns.  Traffic is steady for weeks or months, then experiences a spike in traffic.  That spike may be due to a launch of a new product or service, a new marketing or advertising campaign or sudden user interest.  Inevitably you’ll need more servers and compute power to handle that spike.  That is your peak capacity requirement.

With traditional servers you would need to buy enough servers or big enough ones to support that load or else suffer outages.  What’s more you’d have to plan in advance in order to have those servers online and integrated into the web infrastructure.

With Cloud Computing, you already have spinup scripts for your server types, and can bring additional compute power online with only a few commands.  Even better with AWS Autoscaling, you can define rules to have new servers spinup for you automatically!

Proof-of-Concept Applications

If you’re in the process of testing a new business idea or internet startup, you may not have the budget to order all sorts of heavy iron to support it.  Cloud Computing complements this type of requirement very nicely.  You need dev servers, voila they’re up and running.  Quickly and cheaply.  You may not know what you’ll need in six months or if your idea will take off, and don’t have to risk a big purchase.  Buy only what you need.

Dev and Test Environments

Another application type that really complements cloud computing well is dev and test environments.  You may want to clone your production servers, or bring on a temporary test environment with all of the same components as production.  But you don’t need that setup all of the time.  Just bring the servers online when you need them and stop them when you’re done testing.  You won’t get instance charges while the servers are stopped, but the server images will remain resident on your EBS snapshots!

CPU Intensive Applications

Server farms are used for all sorts of applications such as SETI or the Human Genome Project.  These applications require legions of servers working together to churn through large amounts of data.  That are uniquely fitted to cloud computing, as they are cpu-intensive.  Once you are done, you can easily decomission all of those servers.

Online gaming is another CPU intensive application.  As users access Facebook applications such as Farmville, it’s hard to know in advance what those demands will be from day-to-day.  Enabling a feature like AWS Autoscaling means the compute power does a lot of the capacity planning for you, responding dynamically to need. We wrote a piece on autoscaling MySQL databases.

On-Demand or Unknown Future Requirements

Any other types of applications that have on-demand needs, and for which you don’t know what the future will look like, match cloud computing well.  You avoid the up-front costs of buying a whole rack of servers, and keep servers offline when they’re not busy.

Hey you… made it this far? Grab out newsletter – scalable startups.

iHeavy Insights 78 – Degrade Gracefully

Your recent social media campaign has gone viral.  It’s what you’ve been dreaming about, pinning your hopes on, and all of your hard work is now coming to fruition.  Tens of thousands of internet users, hoards of them in fact, are now descending on your website.  Only one problem, it went down!!

That’s a situation you want to avoid.  Luckily there are some best practices for avoiding scenarios like the one I described.  In engineering it’s termed “degrade gracefully”.  That is continue functioning but with the heaviest features disabled.

Browsing Only, But Still Functioning

One way to do this is for your site to have a browsing only mode.  On the database side you can still be functioning with a read-only database.  With a switch like that, your site will continue to function while pointed to any of your read-only replication slaves.  What’s more you can load balance across those easily, and keep your site up and running.

Decoupling

In software development, decoupling involves breaking apart components or pieces of an application that should not depend on one another.  One way to do this is to use a queuing system such as Amazon’s SQS to allow pieces of the application to queue up work to be done.  This makes those pieces asynchronous, ie they’ll return right away.  Another way is to expose services internal to your site through web services.  These individual components can then be scaled out as needed.  This makes them more highly available, and reduces the need to scale your memcache, webservers or database servers – the hardest ones to scale.

Identify Features You Can Disable

Typically your application will have features that are more superfluous, or that are not part of the core functionality.  Perhaps you have star ratings, or some other components that are heavy.  Work with the development and operations teams to identify those areas of the application that are heaviest, and that would warrant disabling if the site hits heavy storms.

Once you’ve done all that, document how to disable and reenable those features, so other team members will be able to flip the switches if necessary.

Continue reading

Cloud Computing – Disciplined Deployments

With traditional managed hosting solutions, we have best practices, we have business continuity plans, we have disaster recovery, we document our processes and all the moving parts in our infrastructure.  At least we pay lip service to these goals, though from time to time we admit to getting side tracked with bigger fish to fry, high priorities and the emergency of the day.  We add “firedrill” to our todo list, promising we’ll test restoring our backups.  But many times we find it is in the event of an emergency that we are forced to find out if we actually have all the pieces backed up and can reassemble them properly.

** Original article — Intro to EC2 Cloud Deployments **

Cloud Computing is different.  These goals are no longer be lofty ideals, but must be put into practice.  Here’s why.

  1. Virtual servers are not as reliable as physical servers
  2. Amazon EC2 has a lower SLA than many managed hosting providers
  3. Devops introduces new paradigm, infrastructure scripts can be version controlled
  4. EC2 environment really demands scripting and repeatability
  5. New flexibility and peace of mind

Unreliable Servers

EC2 virtual servers can and will die.  Your spinup scripts and infrastructure should consider this possibility not as some far off anomalous event, but a day-to-day concern.  With proper scripts and testing of various scenarios, this should become manageable.  Use snapshots to backup EBS root volumes, and build spinup scripts with AMIs that have all the components your application requires.  Then test, test and test again.

Amazon EC2′s SLA – Only 99.95%

The computing industry throws around the 99.999% or five-nines uptime SLA standard around a lot.  That amounts to less than six minutes of downtime.  Amazon’s 99.95% allows for 263 minutes of downtime.  Greater downtime merely gets you a credit on your account.  With that in mind, repeatable processes and scripts to bring your infrastructure back up in different availability zones or even different datacenters is a necessity.  Along with your infrastructure scripts, offsite backups also become a wise choice.  You should further take advantage of availability zones and regions to make your infrastructure more robust.  By using private IP addresses and network, you can host a MySQL database slave in a separate zone, for instance.  You can also do GDLB or Geographically Distributed Load Balancing to send customers on the west coast to that zone, and those on the east coast to one closer to them.  In the event that one region or availability zone goes out, your application is still responding, though perhaps with slightly degraded performance.

Devops – Infrastructure as Code

With traditional hosting, you either physically manage all of the components in your infrastructure, or have someone do it for you.  Either way a phone call is required to get things done.  With EC2, every piece of your infrastructure can be managed from code, so your infrastructure itself can be managed as software.  Whether you’re using waterfall method, or agile as your software development lifecycle, you have the new flexibility to place all of these scripts and configuration files in version control.  This raises manageability of your environment tremendously.  It also provides a type of ongoing documentation of all of the moving parts.  In a word, it forces you to deliver on all of those best practices you’ve been preaching over the years.

EC2 Environment Considerations

When servers get restarted they get new IP addresses – both private and public.  This may affect configuration files from webservers to mail servers, and database replication too, for example.  Your new server may mount an external EBS volume which contains your database.  If that’s the case your start scripts should check for that, and not start MySQL until it finds that volume.  To further complicate things, you may choose to use software raid over a handful of EBS volumes to get better performance.

The more special cases you have, the more you quickly realize how important it is to manage these things in software.  The more the process needs to be repeated, the more the scripts will save you time.

New Flexibility in the Cloud

Ultimately if you take into consideration less reliable virtual servers, and mitigate that with zones and regions, and automated scripts, you can then enjoy all the new benefits of the cloud.

  • autoscaling
  • easy test & dev environment setup
  • robust load & scalability testing
  • vertically scaling servers in place – in minutes!
  • pause a server – incurring only storage costs for days or months as you like
  • cheaper costs for applications with seasonal traffic patterns
  • no huge up-front costs

MySQL Cluster In The Cloud – Managers Guide

The term clustering is often used loosely in the context of enterprise databases.  In relation to MySQL in the cloud you can configure:

  1. Master-master active/passive
  2. Sharded MySQL Database
  3. NDB Cluster

Master-Master active/passive replication

Also sometimes known as circular replication.  This is used for high availability. You can perform operations on the inactive node (backups, alter tables or slow operations) then switch roles so inactive becomes active.  You would then perform the same operations on the former master.  Applications sees “zero downtime” because they are always pointing at the active master database.  In addition the inactive master can be used as a read-only slave to run SELECT queries and large reporting queries.  This is quite powerful as typical web applications tend to have 80% or more of their work performed with read-only queries such as browsing, viewing, and verifying data and information.

Sharded MySQL Database

This is similar to what in the Oracle world is called “application partitioning”.   In fact before Oracle 10 most Parallel server and RAC installations required you to do this.  For example a user table might be sharded by putting names A-F on node A, G-L on node B and so forth.

You can also achieve this somewhat transparently with user_ids.  MySQL has an autoincrement column type to handle serving up unique ids.  It also has a cluster-friendly feature called auto_increment_increment.  So in an example where you had *TWO* nodes, all EVEN numbered IDs would be generated on node A and all ODD numbered IDs would be generated on node B.  They would also be replicating changes to eachother, yet avoid collisions.

Obviously all this has to be done with care, as the database is not otherwise preventing you from doing things that would break replication and your data integrity.

One further caution with sharding your database is that although it increases write throughput by horizontally scaling the master, it ultimately reduces availability.   An outage of any server in the cluster means at least a partial outage of the cluster itself.

NDB Cluster

This is actually a storage engine, and can be used in conjunction with InnoDB and MyISAM tables.  Normally you would use it sparingly for a few special tables, providing availability and read/write access to multiple masters.  This is decidedly *NOT* like Oracle RAC though many mistake it for that technology.

MySQL Clustering In The Cloud

The most common MySQL cluster configuration we see in the Amazon EC2 environment is by far the Master-Master configuration described above.  By itself it provides higher availability of the master node, and a single read-only node for which you can horizontally scale your application queries.  What’s more you can add additional read-only slaves to this setup allowing you to scale out tremendously.

Migrating MySQL to Oracle Guide

Also find Sean Hull’s ramblings on twitter @hullsean.

Migrating from MySQL to Oracle can be as complex as picking up your life and moving from the country to the city.  Things in the MySQL world are often just done differently than they are in the Oracle world.  Our guide will give you a birds eye view of the differences to help you determine what is the right path for you.

** See also: Oracle to MySQL Migration Considerations **

MySQL comes from a more open-source or DIY background.  One of Unix and Linux administrators and even developers carrying the responsibility of a DBA.

  1. Installation & Administration Considerations
  2. Query and Optimizer Differences
  3. Security Strengths and Weaknesses
  4. Replication & High Availability
  5. Table Types & Storage Engines
  6. Applications, Connection Pooling, Stored Procedures and More
  7. Backups & Disaster Recovery
  8. Community – MySQL & Oracle Differences
  9. TCO, Licensing, and Cloud Considerations
  10. Advanced Oracle Features – Missing in MySQL

Check back soon as we update each of these sections.

Oracle to MySQL Migration Considerations

There are a lot of forms of transportation, from walking to bike riding, motorcycles and cars to busses, trains and airplanes.  Each mode of transport will get you from point a to point b, but one may be faster, or more comfortable and another more cost effective.  It’s important to keep in mind when comparing databases like Oracle and MySQL that there are indeed a lot of feature differences, a lot of cultural differences, and a lot of cost differences.  There are also a lot of impassioned people on both sides arguing at the tomfoolery of the other.  Hopefully we can dispel some of the myths and discuss the topic fairly.

** See also: Migrating MySQL to Oracle Guide **

As a long time Oracle DBA turned MySQL expert, I’ve spent time with clients running both database engines and many migrating from one to the other.  I can speak to many of the differences between the two environments.  I’ll cover the following:

  1. Query & Optimizer Limitations
  2. Security Differences
  3. Replication & HA Are Done Differently
  4. Installation & Administration Simplicity
  5. Watch Out – Triggers, Stored Procedures, Materialized Views & Snapshots
  6. Huge Community Support – Open-source Add-ons
  7. Enter The Cloud With MySQL
  8. Backup and Recovery
  9. Miscellaneous Considerations

Check back again as we edit and publish the various sections above.

iHeavy Insights 77 – What Consultants Do

 

What Do Consultants Do?

Consultants bring a whole host of tools to experiences to bear on solving your business problems.  They can fill a need quickly, look in the right places, reframe the problem, communicate and get teams working together, and bring to light problems on the horizon. And they tell stories of challenges they faced at other businesses, and how they solved them.

Frame or Reframe The Problem

Oftentimes businesses see the symptoms of a larger problem, but not the cause.  Perhaps their website is sluggish at key times, causing them to lose customers.  Or perhaps it is locking up inexplicably.  Framing the problem may involve identifying the bottleneck and pointing to a particular misconfigured option in the database or webserver.  Or it may mean looking at the technical problem you’ve chosen to solve and asking if it meets or exceeds what the business needs.

Tell Business Stories

Clients often have a collection of technologies and components in place to meet their business needs.  But day-to-day running of a business is ultimately about bringing a product or service to your customer.  Telling stories of challenges and solutions of past customers, helps illustrate, educate, and communicate problems you’re facing today.

Fill A Need Quickly

If you have an urgent problem, and your current staff is over extended, bringing in a consultant to solve a specific problem can be a net gain for everyone.  They get up to speed quickly, bring fresh perspectives, and review your current processes and operations.  What’s more they can be used in a surgical way, to augment your team for a short stint.

Get Teams Communicating

I’ve worked at quite a number of firms over the years and tasked with solving a specific technical problem only to find the problem was a people problem to begin with.  In some cases the firm already has the knowledge and expertise to solve a problem, but some members are blocking.  This can be because some folks feel threatened by a new solution which will take away responsibilities they formerly held.  Or it can be because they feel some solution will create new problems which they will then be responsible to cleanup.  In either case bridging the gap between business needs and operations teams to solve those needs can mean communicating to each team in ways that make sense to them.  A technical detail oriented focus makes most sense when working with the engineering teams, business and bottom-line focused when communicating with the management team.

Highlight Or Bring To Light Problems On Horizon

Is our infrastructure a ticking timebomb?  Perhaps our backups haven’t been tested and are missing some crucial component?  Or we’ve missed some security consideration, left some password unset, left the proverbial gate open to the castle.  When you deal with your operations on a day-to-day basis, little details can be easy to miss.  A fresh perspective can bring needed insight.

BOOK REVIEW – Jaron Lanier – You Are Not a Gadget

Lanier is a programmer, musician, the father of VR way back in the 90′s, and wide-ranging thinker on topics in computing and the internet.

His new book is a great, if at times meandering read on technology, programming, schizophrenia, inflexible design decisions, marxism, finance transformed by cloud, obscurity & security, logical positivism, strange loops and more.

He opposes the thinking-du-jour among computer scientists, leaning in a more humanist direction summed up here:  “I believe humans are the result of billions of years of implicit, evolutionary study in the school of hard knocks.”    The book is worth a look.