Category Archives: Database Operations

Howto automate MySQL slow query analysis with amazon RDS

iRobot1

If you’ve used relational databases for more than ten minutes, I hope you’ve heard of slow queries. Those are those pesky little gremlins that are slowing down your startup, and preventing scalability you so desperately need.

Luckily there’s a solution. What I’ve found is if I send a report to developers every week, it keeps these issues front and center, for folks that are very busy indeed.

The script below is for RDS, but you can surely modify it if you have a physical server or roll-your-own MySQL box on Amazon. Take a look & enjoy!

Join 26,000 others and follow Sean Hull on twitter @hullsean.

1. install percona tools

Percona as many probably already know, are a wildly successful services firm that support MySQL and related technologies. They also have a very popular & scalable MySQL distribution by the same name.

Even if you’re not using Percona MySQL, you definitely want to get ahold of the percona toolkit. It provides all sorts of useful tools, including the one this article is based on, query-digest.

This tool takes your stock MySQL slow query logfile as input, and summarizes it into a very useful and readable report. Formerly mk-query-digest, it’s not called pt-query-digest. See below.

You can install the percona tools easily by grabbing the repository file and installing that with rpm. From there you can just use yum or apt-get depending on your distribution.

Related: Why a killer title can make or break your content efforts

2. install aws command line tool

Amazon has consolidated all it’s command line tools into a single one called just “aws”. The options can be a little arcane, and the error messages misleading besides. What’s good though is it is slightly easier to install & configure.

Do you already use Python? Install it this way:


$ pip install awscli

If not, you’ll need to dig into the aws cli installation instructions further.

Also: Do managers underestimate operational costs?

3. edit .aws/config

After you get the tool installed, you need to setup your environment. I edited a file named /home/shull/.aws/config as follows:


[default]
region = us-east-1
aws_access_key_id = BLIBJZMKLWIL5UTNRBMQ
aws_secret_access_key = MF5J/2z7HmN92lQUrV12ZO/FBXNjDVjL52TNRWsG

Those access_key_id and secret_access_key you can find on your amazon dashboard. Click upper right hand corner under your name, select the menu item “Security Credentials”.

Check out: Are SQL Databases Dead?

4. edit send_query_report.sh

I wrote the script below so you can fairly easily edit it.


#!/bin/bash
#

# get the rds db instanceID from command line (or crontab) entry
#
AWS_INSTANCE=$1

# here's where we'll store the latest slowquery.log
#
SLOWLOG=/tmp/rds_slow.log
#SLOWLOG=`/bin/ls -tr /home/shull/*.log | /usr/bin/tail -1`

# fetch slow query log from rds box
# here I always grab the latest one.
#
/usr/local/bin/aws rds download-db-log-file-portion --db-instance-identifier $AWS_INSTANCE --output text --log-file-name slowquery/mysql-slowquery.log > $SLOWLOG

# query report output
SLOWREPORT=/tmp/reportoutput.txt

# pt-query-digest location
MKQD=/usr/local/bin/pt-query-digest

# run the tool to get analysis report
$MKQD $SLOWLOG > $SLOWREPORT

# today's date in a variable
TODAY=`/bin/date +\%m/\%d/\%Y-\%H:\%S`
#YESTERDAY=`/bin/date -d "1 day ago" +\%m/\%d/\%Y-\%H:\%S`

# report subject
SUBJECT="Sean Query Report -- $TODAY "

# recipient
EMAIL="hullsean@gmail.com"

# send an email using /bin/mail
/usr/bin/mailx -s "$SUBJECT" "$EMAIL" < $SLOWREPORT

Note, if you don't have mailx installed, it should be available in your repository. Use apt-get or yum as necessary to get it installed.

Also: Is high availability overrated & near impossible to deliver?

5. Add to crontab

After you've tested the above script from command line, you will want to add it to a weekly cron job. Voila, automation! Don't forget to chmod +x to make it executable. :)


00 09 * * 5 /home/shull/send_query_report.sh seandb

Read: Are MySQL DBA's impossible to find?

Get more. Grab our exclusive monthly Scalable Startups. We share tips and special content. Our latest Why I don't work with recruiters

If you use MySQL in the Amazon cloud, you need to ask yourself this question

Join 25,000 others and follow Sean Hull on twitter @hullsean.

Are you serious about backups?

If you’re just using Amazon EBS snapshots, that may not be sufficient. There’s a good chance it won’t protect you against your next data loss.

That’s why I like to have a few different types of backups

Also: 5 more things deadly to scalability

Protect against operator error

mysqldump is a tool every DBA is familiar with. Same as a hotbackup or snapshot you say? Just more labor? Not true.

A dump allows you to restore one table, or one schema. That’s why they’re also known as logical backups. What’s more you can edit the file, remove indexes, change object names, or datatypes. All these can be essential in the screwy and unpredictable event of a real world outage.

Expect the unexpected!

Read: Why devops talent is in short supply

Test those backups regularly

If you haven’t actually tried to restore, you really don’t know if you have everything. Did you backup stored procedures & database code? How about grants? Database events? How about cronjobs? What about the my.cnf file? And your replication configuration?

Yes there are a lot of little pieces, and testing your backups by rebuilding everything is an attempt to poke holes in your plan, and hit issues before d-day!

Related: MySQL interview guide for managers and candidates alike

Replication isn’t a backup

Replication is getting better and better in MySQL. It used to fail regularly. MyiSAM was very unpredictable. But even in the comfortable realm of Innodb, there can still be data drift. If you’re on MySQL 5.0 or 5.1, you should consider performing regular checksums. These test the integrity of data and compare what’s actually in master & slave. Bulletproofing MySQL replication with checksums.

Read: Why high availability is so very hard to deliver

Have you considered security around your backup files?

While you’re thinking about backups, make sure the files themselves are secure. Remember they contain your crown jewels. Hopefully individual data that’s sensitive is encrypted, but still you should secure their final resting place as well.

If you’re using S3, consider encrypting the file before shipping it up to the bucket.

Read this: Why a four letter word divides dev and ops

Get more. Grab our exclusive monthly Scalable Startups. We share tips and special content. Our latest Why I don’t work with recruiters

Why startups need techops

devops divide

I was at a talk recently on node.js. Even if I’m not working with a technology directly, it’s exciting to see what’s out there, and node.js is bringing some hyper fast performance to a certain category of web applications.

During the keynote, the speaker mentioned a service to deploy applications on. I can’t name names unfortunately but it was a cloud solution on top of which you could deploy your application. Go this route
and you can do without an operations team. Avoid overhead of hiring ops, he claimed. And hey, then you can hire more developers!

To be fair I’ve heard much of the same thing at DBA or linux conferences. I can’t count the number of stories that start with “what some idiot developer did that took down our production systems…”.

Yes, it seems dev & ops are still just a tad bit adverserial.

Join 13,000 others and follow Sean Hull on twitter @hullsean.

1. My little known origins as a developer

Many colleagues and clients I’ve met in the New York City startup industry know me primarily as an operations & scalability guy. I tune databases, infrastructure and components to make things lightening fast.

I spent my earliest years at university on the computer lab operations staff. We watched and managed, made sure level zero backups were taken care of, and moved the tapes. Directly after college, I started at a software firm. I did C++ GUI development on the Mac, using the toolbox libraries with Metrowerks Codewarrior. I built split windows, and scroll bars, and displayed rows of data with nice resizable columns. All this wasn’t built into the class library, so for a lot of it we needed to roll our own solution.

We always had a long list of features coming from the business units. I also fielded many support calls, often from the windows platform as the code there hadn’t been managed and built as carefully. But that too was instructive as you could feel the pain of customers day-to-day challenges. It also illustrated the tradeoffs between new code and features, and existing bug fixes and support.

Also: Why generalists are better at scaling the web

2. A trip through the dot-com bubble as Oracle DBA

Through a circuitous path, I moved to New York in the mid-nineties and joined a startup. I had the opportunity to wear a lot of hats there, and apply my computer lab and Linux operating systems experience to the challenge of managed Oracle. I got a lot more involved with operations quick.

As the dot-com bubble grew, I saw a hot and growing demand for Oracle DBAs as most startups used Oracle, but the talent was in short supply. In one startup 80 million dollars was on the line as performance hobbled the website, and investors feared the worst.

Read: Why the Twitter IPO made a shocking admission about scalability

3. Different priorities & mandates

I remember working at Starmedia a media darling at the time. I was analyzing the database & server systems, and finding some code & jobs running during peak daytime hours. Management claimed that could not be the case. Yet for the next days and weeks I saw the same jobs running. I held strong and spoke truth to power as they say. That’s not always easy when you have a lot of investors, screaming CTOs and 100+ hour weeks. But eventually the source of the job was located, and disabled. And the website returned to it’s speedy self.

These experiences though do underline in my mind the different priorities and focus that developers and operations staff have.

Techops, system administrators & DBAs are typically averse to change. They fight it tooth and nail. That isn’t because they like to be curmudgeons though. They are typically very concerned about the business, but from a dramatically different perspective of stability, and reliability, even at 2am in the morning. They are concerned about the longevity of data, consistency, and durability of it.

Developers on the other hand have a different mandate. They are responsible for new business features, solutions to business requirements. Rapid prototyping & reactive or agile is embraced because it means you can deliver quicker to the business.

Crucially, both of these folks care very much for the business. Just with very different priorities.

Check this: Why AirBNB didn’t have to fail

4. Can developers do operations for you?

In a lot of small startups, the initial phase is obviously on building a product. That’s the build phase, and not surprisingly you hire a lot of developers. As you should. But as you grow you may find the operational tasks that are defaulting to one or more developers are taking more and more of their time. As your customer base grows and you’ve seen your first few spikes, it’s time to start thinking about hiring for a real ops role.

In summary, yes they can, but perhaps not well.

Related: How to hire a developer that you can work with

5. Volume discount, made to order or instant coffee

You may choose to go with instant coffee, by bringing someone in-house. You may find the right talent is hard to find. I wrote about this: Why techops and DBAs are in short supply.

Alternatively you may prefer a volume discount from one of the larger remote DBA or managed support solutions such as Oracle’s, Pythian or Percona. These guys all provide great service, but keep in mind how big of a fish you are. You’ll likely work through a ticketing system, and in some cases different engineers will look at your systems at different times. You will likely need either a very hands-on technical CTO or other in-house person to take ownership, and manage things closely.

The third option is a made-to-order coffee. Yes you pay more for Toby’s, Blue Bottle, or Ninth Street Espresso but you get what you pay for as they say. A boutique shop or independent consultant will provide a lot more hand holding, help your internal staff get up to speed, and communicate intimately about the process. If you’re a more non-technical CTO, or you’re very busy running the business, this solution may make a lot of sense for you.

Also: Why cloud detractors need a history lesson

Get more. Grab our exclusive monthly Scalable Startups. We share tips and special content. Our latest Why I don’t work with recruiters

No tools to reconcile MySQL with two masters

Here’s the last nail in the coffin. We hope this convinces you not to write to multiple masters with MySQL replication.

Reason 10 – No tools to reconcile inconsistent masters

For all the endless reasons we’ve already outlined MySQL replication is prone to failure. We know it’s going to happen, you now know too.

When replication breaks, your application may still be writing to both sides, both masters. Now you have data differences on both sides, random rows, in random tables.

Percona Toolkit checksum tool is great when you have a single authoritative master. It can compare & show you diffs. But when you’ve changed two masters, you have no authoritative side. Neither one is the last word on your data. You’re then faced with an intractable problem of reconciling your data in some manual or do-it-yourself method.

Not only would such a scenario be hairy and prone to error, but you would effectively have an outage for the duration. All ugly scenarios to be sure.

Click through to the end for multi-master solutions that work with MySQL.

PREV: Reason 9 – Temp tables can break replication

Want more? Grab our Scalable Startups monthly for more tips and special content. Here’s a sample

Temp tables can break replication

If you’re not convinced yet that writing to dual masters is a bad idea, we have a couple more reasons.

9. Temp tables break replication after restart

MySQL’s replication is sensitive to temporary tables. You shouldn’t use them. If your queries create them to work against, and a node crashes, the temp tables will be missing upon restart. You can imagine what this will do to all those queries which expect to see a temp table. That’s right, they’ll fail.

Theoretically you could have a start script which creates and populates such temp tables. However this may add operational complexity if not carefully coordinated with developers & change management.

NEXT: Reason 10 – No tools to reconcile inconsistent masters

PREV: Reason 8 – Crashed nodes can corrupt your cluster

Want more? Grab our Scalable Startups monthly for more tips and special content. Here’s a sample

Crashed nodes corrupt your MySQL cluster

Writing to two masters is like walking around with a loaded shotgun. Eventually one of your instances will fail and when it does, replications position & synchronization information could easily become corrupt!

Reason 8 – Crashed nodes cause big problems

MySQL instances, unfortunately can crash. When that happens, they don’t always sync the replication position properly. That’s a big risk. It’s one of the reasons why operational DBAs know that replication slaves need to be rebuilt from time to time.

If the replication position doesn’t sync properly when MySQL restarts it may do so at the wrong position and replay or miss some transactions. What this would do to an active-active setup is unpredictable. Again no single authoritative master means trouble!

MySQL 5.5 has introduced some parameters to address this. During my tests I’ve seen a performance hit, so use them with caution.

[code]
sync_binlog
sync_relay_log
sync_master_info
sync_relay_log_info
[/code]

NEXT: Reason 9 – Crashed nodes cause big problems

PREV: Reason 7 – Can’t add third node easily

Want more? Grab our Scalable Startups monthly for more tips and special content. Here’s a sample

10 reaons active-active is hard and how to solve it

Multi-master replication provides redundant copies of your most important business assets. What’s more it allows applications to scale out, which is perfect for cloud hosting solutions like Amazon Web Services.

But when you decide you need to scale your write capacity, you may be considering active-active setup. This is dangerous, messy and prone to failure. We’ve outlined how.

Click through to the end for multi-master solutions that work with MySQL.

Reason 1 – auto_increment introduces new problems

o MySQL’s auto_inc settings make it more difficult to change servers around in your overall replication topology

o Using auto_inc settings can cause MySQL to introducing gaps in your primary keys which is a waste of space

o Such a solution would require all tables to have auto_inc primary keys

NEXT: Reason 2 – MySQL replication is brittle to start with

Want more? Grab our Scalable Startups monthly for more tips and special content. Here’s a sample

The Art of Resistance

Sometimes, you have to be the bad guy. Be resistant to change. Here’s a story about how stubbornness pays off. As we’ve written about before A 4 letter word divides Dev & Ops.

I had one experience working as the primary MySQL DBA for an internet startup. Turns out they had Oracle for some applications too. And another DBA just to handle the Oracle stuff.

So it came time for Oracle guy to go on vacation. Suddenly these Oracle systems landed on my shoulders. We reviewed everything in advance, then he bid his goodbyes.

Almost as soon as he was out the door I started getting requests to change things.
“Oh we have to add this field”, or “oh we’d like to make this table change”.

I resisted enough to hold off development for a week.

“We’re now getting more heat from the CTO. Apparently certain pages on the site don’t save some very important content properly. It’s costing the business a lot of money. We need to make this change.”

My response -

[quote]No I can’t sanction this. If you want to do it, understand that it could well break Oracle’s replication.[/quote]

Oracle’s multi-master replication notoriously requires a lot of baby sitting. We held off for a few more days, and the Oracle DBA returned from his much needed vacation.

When discussing it afterward he said…

[quote]Am very glad you didn’t change those fields in the database. It would indeed have broken replication and caused problems with the backups![/quote]

Moral of the story…

  • apply the brakes around tight turns
  • don’t be afraid to say, we’re not going to do this before testing
  • you may have to say – we’re not going to do this at all…

A History lesson for Cloud Detractors

Computing history

We’ve all seen cloud computing discussed ad nauseam on blogs, on Twitter, Quora, Stack Exchange, your mom’s Facebook page… you get the idea. The tech bloggers and performance experts often pipe in with their graphs and statistics showing clearly that dollar-for-dollar, cloud hosted virtual servers can’t compete with physical servers in performance, so why is everyone pushing them? It’s just foolhardy, they say.

On the other end, management and their bean counters would simply roll their eyes saying this is why the tech guys aren’t running the business.

Seriously, why the disconnect? Open source has always involved a lot of bushwacking…

Continue reading

What Ops doesn’t tell you about your MySQL Database

MySQL is a very scalable platform which has proven robust even in the most dense and complex data environments. MySQL’s indispensable replication function is ‘sold’ as being fail-safe so you have little to sweat about as long as your backups are running regularly. But what the ops guys aren’t telling you is MySQL performs replication with tiny margins of error that could cause big problems in times of disaster.

Replication database backup

The Scenario

Imagine the scene, you use replication to backup your data. Your secondary database is your peace of mind. It’s the always-on clone of your crown jewels. You even perform backups off of it so you don’t impact your live website. Your backups run without errors. Your slave database runs without errors. Then the dreaded day comes when your primary database fails. You instruct your team to switchover your application to point to your live backup database. The site comes online again. But all is not right. You notice subtle differences and your team begins to question how deep the data divide could be.

The Problem with MySQL replication

Although MySQL replication is fairly easy to setup, and even to keep running without error, you may have unseen problems. MySQL’s core technology to replicate data between master and slave is primarily statement based. Various scenarios can cause what in other database platforms you might call database corruption, that is silent drifting of data from what tables and rows contain on the master. It is no fault of your own, or perhaps one might argue even of your operations team. It is a fundamental flaw in how MySQL performs replication.

The Solution

Fortunately there is a solution. Checksums, the wonderful computational tool for comparing things can be put to work nicely to compare database. The Percona Toolkit (formerly maatkit) includes just such a utility for use with MySQL. It can be used to check the integrity of your slave databases.
If you’ve never performed such a check, you should do so ASAP. If your database has been running for months at a stretch, chances are there could be differences lying undiscovered between the two systems.

Depending on the volume changing in your database, you can continue to use this tool periodically to confirm that all is consistent. If integrity checks fail, there is another tool in Maatkit to syncronize differences, and bring everything back to order.