Tag Archives: always on

Point-in-time Recovery – What is it and why is it important?

Web-facing database servers receive a barrage of activity 24 hours a day.  Sessions are managed for users logging in, ratings are clicked and comments are added.  Even more complex are web-based ecommerce applications.  All of this activity is organized into small chunks called transactions.  They are discrete sets of changes.  If you’re editing a word processing document, it might autosave every five minutes.  If you’re doing something in excel it may provide a similar feature.  There is also an in-built mechanism for undo and redo of recent edits you have made.  These are all analogous to transactions in a database.

These are important because all of these transactions are written to logfiles.  They make replication possible, by replaying those changes on another database server downstream.

If you have lost your database server because of hardware failure or instance failure in EC2, you’ll be faced with the challenge of restoring your database server.  How is this accomplished?  Well the first step would be to restore from the last full backup you have, perhaps a full database dump that you perform everyday late at night.  Great, now you’ve restored to 2am.  How do I get the rest of my data?

That is where point-in-time recovery comes in.  Since those transactions were being written to your transaction logs, all the changes made to your database since the last full backup must be reapplied.  In MySQL this transaction log is called the binlog, and there is a mysqlbinlog utility that reads the transaction log files, and replays those statements.  You’ll tell it the start time – in this case 2am when the backup happened.  And you’ll tell it the end time, which is the point-in-time you want to recover to.  That time will likely be the time you lost your database server hardware.

Point-in-time recovery is crucial to high availability, so be sure to backup your binlogs right alongside your full database backups that you keep every night.  If you lose the server or disk that the database is hosted on, you’ll want an alternate copy of those binlogs available for recovery!

Quora discussion on Point-in-time Recovery by Sean Hull

Root Cause Analysis – What is it and why is it important?

Root Cause Analysis is the means to identify the ultimate source and cause of an outage.  When an outage occurs that causes serious downtime of a website, typically organizations are in crisis mode.  Urgency of resolution sometimes pushes aside due process, change management and general caution.  Root Cause Analysis attempts to as much as possible isolate logfiles, configurations, and the current state of systems for later analysis.

With traditional physical servers, physical hardware failure, operator error, or a security breach can cause outages.  Since you’re dealing with one physical machine, resolving that issue necessarily means moving around the things that broke.  So caution and later analysis must be balanced with the immediate problem resolution.

Another silver lining in cloud hosted solutions is around root cause analysis.  If a server was breached for example, that server can immediately be shutdown, while maintaining it’s current state as a disk or EBS snapshot.  A new server can then be fired up from a AMI image, then your server rebuilt from scripts or template and you’re back up and running.  Save the snapshot then for later analysis.

This could be used for analysis of operator error related outages as well.  Hardware failures are more expected and common in cloud hosted environments, so this should and really must push adoption of best practices around infrastructure, that is having scripts at hand that rebuild everything from bare metal.

More discussion of root cause analysis by Sean Hull on Quora.

Sharding – What is it and why is it important?

Sharding is a way of partitioning your datastore to benefit from the computing power of more than one server.  For instance many web-facing databases get sharded on user_id, the unique serial number your application assigns to each user on the website.

Sharding can bring you the advantages of horizontal scalability by dividing up data into multiple backend databases.  This can bring tremendous speedups and performance improvements.

Sharding, however has a number of important costs.

  • reduced availability
  • higher administrative complexity
  • greater application complexity

High Availability is a goal of most web applications as they aim for always-on or 24×7 by 365 availability.  By introducing more servers, you have more components that have to work flawlessly.  If the expected downtime of any one backend database is 1/2 hour per month and you shard across five servers, your downtime has now increased by a factor of five to 2.5 hours per month.

Administrative complexity is an important consideration as well.  More databases means more servers to backup, more complex recovery, more complex testing, more complex replication and more complex data integrity checking.

Since Sharding keeps a chunk of your data on various different servers, your application must accept the burden of deciding where the data is, and fetching it there.  In some cases the application must make alternate decisions if it cannot find the data where it expects.  All of this increases application complexity and is important to keep in mind.

Sean Hull asks on Quora – What is Sharding and why is it important?

Zero Downtime – What is it and why is it important?

For most large web applications, uptime is of foremost importants.  Any outage can be seen by customers as a frustration, or opportunity to move to a competitor.  What’s more for a site that also includes e-commerce, it can mean real lost sales.

Zero Downtime describes a site without service interruption.  To achieve such lofty goals, redundancy becomes a critical requirement at every level of your infrastructure.  If you’re using cloud hosting, are you redundant to alternate availability zones and regions?  Are you using geographically distributed load balancing?  Do you have multiple clustered databases on the backend, and multiple webservers load balanced.

All of these requirements will increase uptime, but may not bring you close to zero downtime.  For that you’ll need thorough testing.  The solution is to pull the trigger on sections of your infrastructure, and prove that it fails over quickly without noticeable outage.  The ultimate test is the outage itself.

Sean Hull on Quora: What is zero downtime and why is it important?

High Availability – What is it and why is it important?

Highly available systems build redundancy into the application and the architecture layers to mitigate against disasters.  Since computing systems are made from commodity hardware and components which are prone to failure, having redundancy at every layer is key.

Redundancy of switches, network interfaces, and load balanced webservers are fairly straightforward and run-of-the-mill.  But clustering your database tier is another trick entirely.  With MySQL, master-master active-passive can work quite well, running circular replication to send all changes to both nodes.  Both nodes are able to handle production traffic, and you pick the one that is active simply by configuring your application to point to that.  Use a technology like MMM or Pacemaker to front your database cluster with a virtual IP (vip), so no application or webserver changes are required to switch which node takes on the master role.

Redundant components are important in a single datacenter, but what if that datacenter goes out or gets hit by a natural disaster?  Is your whole business out?  That’s where geographic redundancy comes in.  Geographic redundancy and geo load balanced DNS comes in.  Having redundant copies of your whole site on both the east and west coast with geo-dns provides the next level of high availability.

Sean Hull discusses on Quora – What is High Availability and why is it important?

iHeavy Insights 78 – Degrade Gracefully

Your recent social media campaign has gone viral.  It’s what you’ve been dreaming about, pinning your hopes on, and all of your hard work is now coming to fruition.  Tens of thousands of internet users, hoards of them in fact, are now descending on your website.  Only one problem, it went down!!

That’s a situation you want to avoid.  Luckily there are some best practices for avoiding scenarios like the one I described.  In engineering it’s termed “degrade gracefully”.  That is continue functioning but with the heaviest features disabled.

Browsing Only, But Still Functioning

One way to do this is for your site to have a browsing only mode.  On the database side you can still be functioning with a read-only database.  With a switch like that, your site will continue to function while pointed to any of your read-only replication slaves.  What’s more you can load balance across those easily, and keep your site up and running.


In software development, decoupling involves breaking apart components or pieces of an application that should not depend on one another.  One way to do this is to use a queuing system such as Amazon’s SQS to allow pieces of the application to queue up work to be done.  This makes those pieces asynchronous, ie they’ll return right away.  Another way is to expose services internal to your site through web services.  These individual components can then be scaled out as needed.  This makes them more highly available, and reduces the need to scale your memcache, webservers or database servers – the hardest ones to scale.

Identify Features You Can Disable

Typically your application will have features that are more superfluous, or that are not part of the core functionality.  Perhaps you have star ratings, or some other components that are heavy.  Work with the development and operations teams to identify those areas of the application that are heaviest, and that would warrant disabling if the site hits heavy storms.

Once you’ve done all that, document how to disable and reenable those features, so other team members will be able to flip the switches if necessary.

Continue reading iHeavy Insights 78 – Degrade Gracefully