What Ops doesn’t tell you about your MySQL Database

MySQL is a very scalable platform which has proven robust even in the most dense and complex data environments. MySQL’s indispensable replication function is ‘sold’ as being fail-safe so you have little to sweat about as long as your backups are running regularly. But what the ops guys aren’t telling you is MySQL performs replication with tiny margins of error that could cause big problems in times of disaster.

Replication database backup

The Scenario

Imagine the scene, you use replication to backup your data. Your secondary database is your peace of mind. It’s the always-on clone of your crown jewels. You even perform backups off of it so you don’t impact your live website. Your backups run without errors. Your slave database runs without errors. Then the dreaded day comes when your primary database fails. You instruct your team to switchover your application to point to your live backup database. The site comes online again. But all is not right. You notice subtle differences and your team begins to question how deep the data divide could be.

The Problem with MySQL replication

Although MySQL replication is fairly easy to setup, and even to keep running without error, you may have unseen problems. MySQL’s core technology to replicate data between master and slave is primarily statement based. Various scenarios can cause what in other database platforms you might call database corruption, that is silent drifting of data from what tables and rows contain on the master. It is no fault of your own, or perhaps one might argue even of your operations team. It is a fundamental flaw in how MySQL performs replication.

The Solution

Fortunately there is a solution. Checksums, the wonderful computational tool for comparing things can be put to work nicely to compare database. The Percona Toolkit (formerly maatkit) includes just such a utility for use with MySQL. It can be used to check the integrity of your slave databases.
If you’ve never performed such a check, you should do so ASAP. If your database has been running for months at a stretch, chances are there could be differences lying undiscovered between the two systems.

Depending on the volume changing in your database, you can continue to use this tool periodically to confirm that all is consistent. If integrity checks fail, there is another tool in Maatkit to syncronize differences, and bring everything back to order.