Join 15,100 others and follow Sean Hull on twitter @hullsean.
After some intermittent outages over the past weeks, we managed to get the site rebuilt on a brand new server. Yes we’re still hosted at a traditional datacenter, 1and1.com to be exact. They were very helpful through the recovery process.
Here are five lessons learned.
1. Keep various types of backups
While the site was down, I was scrambling to consider what I had lost and what would be the fastest way to get the site back online. First there is the content, images and database backup. But wordpress also has an export function, so an additional XML backup of all site content proved invaluable. Further you have the software, the content management system of wordpress itself. Yes you can download a new copy, but what about the plugins, and header modifications to support Google Analytics?
2. Be patient with the support staff
Eventually you’ll need to get on a call with the support techs. They may try to help, but speak of worse case scenarious, and ask if you have a backup, like you’re really going to need it! Remain calm, and ask for clarification.
In my case, the mirrored root drives looked unsalvageable. Turns out that although the md volumes – Linux’s software RAID – could not be rebuilt, the good drive contained most of the data I needed to restore. In that case a filesystem check fixed the journals and brought back a copy of data.
3. Keep detailed notes of all your components
When you’re scrambling, and frustrated, notes save you. There are way more moving parts than you can keep all in your head, so why not keep good notes too?
For my site, there is wordpress, the version, MySQL and it’s version, config files for Apache & MySQL, dump files of individual schemas, existing installed plugins, themes, mods, google analytics configuration, passwords for the server, and wordpress itself. And don’t forget the logins to your hosting dashboard, PINs and secret identifiers. Write it all down!
4. Monitor and test more
As it turns out I wasn’t hacked this time around, but was having a disk failure. This was manifesting in a strange way. Software RAID needs to be monitored, as does the syslog & mysql & apache logs.
Also, always be testing. I found that although I was using pingdom to notify me of outages, it would not *repeat* that notification. Since I had an extended outage of many days off and on, I had no idea of this. Pingdom had sent me one notification which I missed, and none to follow.
Check this: Why high availability is so very hard to deliver
5. Don’t forget the holidays!
Yes an outage is serious, but keep a sense of perspective. Have a splash page sitting at the ready, while you’re fiddling with all the components.
Although your digital billboard is down, your audience and customers are probably more patient than you realize. If you provide valued content, they’ll be back!