The 2am phone call
Last summer I got my call from the president at 2am. Actually it was my former boss at Hollywood Reporter. I had worked there three months previous, and they had since hired an outsourced DBA solution. Big outsource, big chops. And big fail.
12 hours to liftoff
I was scrambling to pack my luggage to go on summer vacation. I was bound for SF at the moment and my flight was leaving in the morning. I was trying to wrap up loose ends and my former boss was entreating me – “Can you help us? Our replication setup has just melted down. We need you to cleanup the mess.”
The so-called pain point
After a few more early am Skype calls and chats, the team retired for the night and I finished packing my bags. I snuck in an hour of sleep then headed straight for the airport. Once through airport security, I bust out my laptop and start logging into the servers.
Although the exact cause of the replication failure remained opaque, I was asked to scan both databases and determine differences. Out of my toolbox comes the perfect tool for the job, pt-table-checksum, and I run scans on both databases. (For the curious, here is how) I find countless records different between the two databases.
Now my flight is boarding, so I pack up the laptop and find my seat. As soon as the seat belt lights flash off, I’m flipping open my macbook at getting inflight wifi working. Through the flight I’m on SKYPE with the team, with command line terminals open to the servers. Discuss, debug, troubleshoot – rinse, repeat.
From there I write up a report and explain to the team & CTO the problem. Syncing that many different records is too risky. We’d have to review all the statements one-by-one. I’d rather rebuild replication from scratch.
From there the CTO gives the go ahead, and with the help of Percona’s xtrabackup to do online hotbackups, we are able to fix replication without downtime. Amen to that!
Now with our primary MySQL database and secondary read-only one back online, things calm down a lot. Traffic returns to a smooth predictable 2 million pageviews per day. That’s smooth and predictable on a site that gets 50 million a month! The database loads are calm and steady, as our all of our nerves. In the coming days we continue to monitor the situation, and write up lengthly root cause analysis of the situation.
Freelancers & Consultants take note
To my recent Consulting 101 article I would add the following bullets:
- Responsiveness is crucial
- Be an integral part of your team
- Have laptop will travel
- Don’t break things
- Small & Nimble wins the day
- Choose passionate, yet conservative & risk averse operations folks
Be there when a client needs you, and your value goes up. Be reliable, and loyal to those you’ve worked with.
Everyone knows eachother virtual or in real life, and are comfortable with the parts they play. A team that can work together is crucial, whether it’s all fulltime folks, some consultants, some outsourced or wherever they may be. Each has a role to play, and communication and team work brings it all together.
I never turn down a job. There will be plenty of time for vacations and rest when the dust settles.
If there is any doubt in your mind, test, and test again. Always err on the side of caution. Check thrice and cut once! If you haven’t done an operation ten, twenty or fifty times before, experiment a few more times with options to be sure. And most importantly, if you don’t login to the systems you’re working on regularly, you better make damn sure you’re on the right box, flipping the right switch, and moving the right dials. With modern internet infrastructure, there are a hundred ways to push the wrong red button!
CTOs and Directors of Operations take note
I’ve used this value proposition before when speaking to prospects. You can hire a big firm, and be a small fish to them. Small fish means you’re gonna get less attention. OR you can hire a small firm or contractor. Then you’ll be a big fish to him or her. Guess what? If you’re their big fish, they’re gonna pay extra attention to every move they make, and ensure things don’t break. They can’t afford mistakes, not to their reputation or their bottom line. Not like the big boys can.
In developers you’re building technology, features, and forging ahead into new solutions. The role is more to create waves, and break barriers. How can we enable new business processes and so forth?
In hiring operations personnel you want stability. Look for individuals who are more risk averse. This conservative streak is a countering force. Ops teams are tasked with that job of bringing a steady state to your business services. They don’t want to wake up at 2am in the morning.