Tag Archives: recovery

When fat fingers take down your business

apple sad mac fail

Join 14,000 others and follow Sean Hull on twitter @hullsean.

Github goes nuclear

I was flipping through reddit last night, and hit this crazy story. strange pushes on GitHub. For those who don’t know, github houses source code. It’s version control for the software world. Lots of projects use it, to keep track of change management.

Jenkins is a continuous integration platform. Someone working on the project accidentally did a force push up to the server. They overwrote not only their own work, but the work of hundreds of other plugins unrelated to his own project.

This is like doing a demolition to put up a new building, and taking down all the buildings on your block and the next. Not very neighborly, to say the list. They’re still at the time of this writing, doing cleanup, and digging through the rubble.

Read: Why DynamoDB can increase availability

How to kill a database

I worked a startup a few years back that had an interesting business model. Users would sit and watch videos, and get paid for their time. Watch the video, note the code, enter the code, earn cash. Somehow the advertisers had found a way to make this work.

The whole infrastructure ran on Amazon EC2 servers, and was managed by Rightscale. Well it was actually managed by an west coast outsourcing shop, whose specialty was managing deployments on Righscale.

The site kept it’s information in a MySQL database. They had various scripts to spinup slaves, remaster, switch roles and so forth. Of course MySQL can be finicky and is prone to throwing surprises your way from time to time.

One time this automation failed in a big way, switching over production customers to a database that took way way way too long to rebuild. As their automation didn’t perform checksums to bulletproof the setup it couldn’t know that all the data wasn’t finished moving!

Customers sure did notice though when the site fell over. Yes this was a failure of automation. But not of the Rightscale platform, but of the outsourcing firm managing the process, checking the pieces and components and ensuring the computer systems did their thing to completion. Huge fail!

Read: Why devops talent is in short supply

Your website will fail

Sites big and small fail. Hopefully these stories illustrate that fact. I’ve said over and over why perfect availability is a pipe dream.

At the end of the day, the difference between the successful sites and the sloppy ones isn’t failure and perfection. It’s *how* they fail, and how they get back up on their feet. What type of planning did they do for disaster recovery like many firms in NYC did before and after Sandy.

Also: Why startups need both devs and ops for scalability

Reducing failure

So instead of thinking about eliminating failure, let’s think about *reducing* it from happening, and when it does, reducing the fallout. One thing you can do is signup for scalable startups where we share tips once a month on the topic. Meanwhile try to put these best practices into play.

1. Test your DR plan by running real life fire drills
2. Use more than one hosting provider, data center or cloud provider
3. Give each op or end user the least privileges they need to do their job
4. Embrace a culture of caution in operations
5. Check, double check and triple check those fat fingers!

Read this: Why a four letter word divides dev and ops

Get more. Grab our exclusive monthly Scalable Startups. We share tips and special content. Our latest Why I don’t work with recruiters

5 Things You Overlooked with MySQL Dumps

1. Point In Time Recovery

If you’ve never done point in time recovery, it’s time to take a second look. With a standard mysqldump you restore your database to the time when the backup happened. Only do them once a day, then you can lose as much as 24 hours of data.

Enter point-in-time recovery, and you have the option to restore all those transactions that occured since your backup last night. Those changes, INSERTs, UPDATEs, DELETEs & even ALTER TABLEs are all stored in the binary logs. That’s right, all those statements are preserved for you!

The tool to fetch those statements with is called mysqlbinlog. When you use it, you’ll need to specify the binlog file & offset to start at. Uh oh, you don’t have those?

That’s where the mysqldump option –master-data comes in. We recommend using –master-data=2 option. That includes this important nuggets of information in the dump file as a comment.

Now you’ll know where to start applying transactions. Simply dig this information out of the dump file.

What I like to do is dump the statements into a text file first, then apply them.

Here’s how you fetch all the transactions since your dump:

$ mysqlbinlog --offset=12345 /var/lib/mysql/binlog.000005 > my_point_in_time_data.mysql

Here’s how you would then apply them:

$ mysql < my_point_in_time_data.mysql

Also checkout: Replication setup with no downtime - using hotbackups!

2. Locking & Consistency

MySQL can be confusing when it comes to concurrency and locking. Do we need to lock tables during a backup, yes, no? Sometimes? The confusion is because MyISAM tables are non-transactional. So sometimes locking all tables is required to get a consistent dump.


This option is only required if your application works on tables across schemas. So if any one application hits multiple schemas and there are MyISAM tables in both, you'll need to use this option. It's equivalent to FLUSH TABLES WITH REACH LOCK.


This option locks all tables in one schema, performs backup, then moves on to the next schema. You only need it if you have MyISAM tables. It's on by default and with --opt flag.


This is the least invasive way to do a dump of your data. However it only works if all your tables are InnoDB. Keep in mind it won't error out if you have MyISAM tables, but your backup will be inconsistent. The plus side is that from your applications perspective there is really no blocking locks happening. Very cool indeed!

For a more general guide see Accidental DBAs Guide to MySQL.

3. Backups of stored code

If you have stored procedures, triggers or functions, you're not capturing them in your dumps by default. Really! Use the --routines option to include them. Better yet do a dump with just your stored code. This line will do so, and add a date timestamp for you. Be sure to have the [client] or [mysql] section in your .my.cnf file with user & password specified. Then you don't need them on command line.

$ mysqldump --no-data --no-create-info --all-databases --routines > db_code_`/bin/date +%m%d%Y`.mysql

Are you verifying replication slaves? You should know they regularly drift out of sync without throwing errors. Read more to verify MySQL replication..

4. Separate schema & data dumps

Sometimes it's useful to have a schema only backup. That is a dump of the objects in your database (CREATE statements) without any of the data. Why you ask? For starters you could use the unix "diff" command between a schema dump today, and one you did last month. That's a cool use case, but there are others. Suppose you want to build a test server, without all the production data or some subset of it? Again, a schema only dump is your friend.

o Use unix diff to compare objects in two different databases
o Audit schema changes over time
o Build test servers without all the production data
o Parallel loading of data
- use multiple mysql processes to import data
- break up the data only portion of your backup

Here's the syntax to get the schema only dump:

$ mysqldump --no-data --all-databases > db_schema_`/bin/date +%m%d%Y`.mysql

Here's how you get the data only dump:

$ mysqldump --no-create-info --all-databases --complete-insert --skip-quote-names > db_data__`/bin/date +%m%d%Y`.mysql

Looking for a MySQL DBA position? Check our our interview questions guide.

5. A few more options


By default mysqldump includes INSERT statements built to *follow* create statements. So the assume all the columns match. If you break up your schema & data you could potentially have a mismatch. This option makes sure the INSERT statemens included in the dump specify column names. So if they don't match, you'll get a proper error message.


MySQL's default dump settings will quote everything under the sun. That's to protect if you use certain reserved words or worse if you have objects with spaces in the names. But of course best practices advice against *both* of those in your database, so be advised. This option will make your dump much more readable and easier to edit.


We had a customer who was using the cross-platform media publishing system K4. When we moved data from one server to another, we had a terrible time at first. That's because this enterprise application stores data in MySQL blobs. Unfortunately the default mysqldump options didn't account for certain special characters which were getting munged. The way around that is to use this option. If you have blobs, you really need this.

Read this far? Grab our scalable startups newsletter.

Root Cause Analysis – What is it and why is it important?

Root Cause Analysis is the means to identify the ultimate source and cause of an outage.  When an outage occurs that causes serious downtime of a website, typically organizations are in crisis mode.  Urgency of resolution sometimes pushes aside due process, change management and general caution.  Root Cause Analysis attempts to as much as possible isolate logfiles, configurations, and the current state of systems for later analysis.

With traditional physical servers, physical hardware failure, operator error, or a security breach can cause outages.  Since you’re dealing with one physical machine, resolving that issue necessarily means moving around the things that broke.  So caution and later analysis must be balanced with the immediate problem resolution.

Another silver lining in cloud hosted solutions is around root cause analysis.  If a server was breached for example, that server can immediately be shutdown, while maintaining it’s current state as a disk or EBS snapshot.  A new server can then be fired up from a AMI image, then your server rebuilt from scripts or template and you’re back up and running.  Save the snapshot then for later analysis.

This could be used for analysis of operator error related outages as well.  Hardware failures are more expected and common in cloud hosted environments, so this should and really must push adoption of best practices around infrastructure, that is having scripts at hand that rebuild everything from bare metal.

More discussion of root cause analysis by Sean Hull on Quora.

Backups – What are they and why are they important?

Backups are obviously a crucial component in any enterprise application.  Modern internet components are prone to failure, and backups keep your bases covered.  Here’s what you should consider:

  1. Is your database backed up, including object structures, data, stored procedures, grants, and logins?
  2. Is your webserver doc-root backed up?
  3. Is your application source code in version control and backed up?
  4. Are your server configurations backed up?  Relevant config files might include those for apache, mysql, memcache, php, email (postfix or qmail), tomcat, Java solr or any other software your application requires.
  5. Are your cron or supporting scripts and jobs backed up?
  6. Have you tested all of these components and your overall documentation with a fire drill?  This is the proof that you’ve really covered all the angles.

If you do your backups right, you should be able to restore without a problem.

Sean Hull asks on Quora – What are backups and why are they important?