MySQL Disaster Recovery

Like all databases, MySQL needs a disaster recovery plan. In this article we discuss some specific experiences at a client site where disk errors began to cause database problems, and how the disk was replaced, and the database recovery process.
Introduction

MySQL is a great database, and for this client, 2000 subscribers and an average of 200,000 hits per month, it is more than enough. In our case we are using MyISAM tables, which do not support transactions. What that means is that as soon as you insert, update, or delete data, it is immediately reflected in the database files. Discussions of what context this is relevant and sufficient for your application are beyond the scope of our discussions here.
Discovering the problem

The application first starting showing trouble with errors on the website about sessions. My first thought was the database itself could be down, so I checked for running processes:

$ ps auxw | grep mysql

Seeing that the processes were running, I thought one might be hung, so I stopped and started the database just to be sure. This is a quick process, so wouldn’t impact things much. Besides most frontend users were probably experiencing the session problem, since almost every page you view on the site checks your session identifier:

$ /etc/rc.d/init.d/mysql stop

$ /etc/rc.d/init.d/mysql start

The session problem continued to rear it’s ugly head, so I looked into table checking. First I ran myisamchk on the sessions table:

$ cd /var/lib/mysql/my_db_name/

$ myisamchk activesessions.MYI

The name of the table in this case is “activesessions” and the name of the database for my example purpose is “my_db_name”. Checking on the frontend, the site began to work again, so I thought the problem was solved.
Within fifteen minutes, I’m getting paged by the client again, and realize the problem is not solved, and I’m starting to worry a bit. This is not normal behavior, and I’m worrying about corruption. I shutdown the database, and do a thorough repair of all tables:

$ myisamchk -r -s --update-state *.MYI

I startup the database again, and find that there is still intermittent problems. I’m also starting to check the logfiles. /var/log/mysqld.log and find that the database is crashing, and then being restarted every few seconds. Corruption I wonder? I verify that our backups are intact, then start looking further afield. I check /var/log/messages and find something serious:

Mar 12 01:58:51 www kernel: hda: dma_intr: status=0x51 { DriveReady SeekComplete Error }

Mar 12 01:58:51 www kernel: hda: dma_intr: error=0x40 { UncorrectableError }, LBAsect=857785, sector=857712

Mar 12 01:58:51 www kernel: end_request: I/O error, dev 03:01 (hda), sector 857712

Mar 12 01:58:51 www kernel: hda: dma_intr: status=0x51 { DriveReady SeekComplete Error }

Mar 12 01:58:51 www kernel: hda: dma_intr: error=0x40 { UncorrectableError }, LBAsect=857785, sector=857720

Mar 12 01:58:51 www kernel: end_request: I/O error, dev 03:01 (hda), sector 857720

Immediately I call support, and discuss the problem.
Replacing the Disk

Luckily the disk is still working well. That means a “ghost” of the disk can be done to a new disk without errors. If the disk had crashed, it would have been a much more difficult recovery. The support folks go ahead and do the ghost procedure after shutting down the machine completely, which is something roughly akin to:

$ dd if=/dev/hda of=/dev/hdb bs=1024

The unix command “dd” does a device to device copy, and will copy every block without change. The support folks may have a hardware solution which does this.
Repairing the Database

After the machine came back online, I shutdown the database again (it starts automatically at boot).

$ myisamchk -r -s --update-state *.MYI

Once that’s done, I startup the database:

$ /etc/rc.d/init.d/mysql start

Problem solved.
Monitoring

Going forward it couldn’t hurt to write some scripts for Nagios to watch /var/log/messages for errors. Searching for the string “UncorrectableError” would make sense. In addition, a script to monitor the results of myisamchk would also be a good idea.
Conclusion

MySQL recovered fine from this brush with disaster, but one may not always be so lucky to have the disk repair be so straightforward. That’s why the dumps of the database, and regular backups are so important. Be proactive, or learn Murphy’s Law the hard way. One way or the other we all run into these issues sooner or later.