Anatomy of a Performance Review

A lot of firms come to us with a specific scalability problem. “Our user base is growing rapidly and the website is falling over!” Or they’re selling more widgets, “Our shopping cart is slowing down and we’re seeing users abandon their purchases”. These are real startup growing pains, so what to do?

We like to take a measured approach with these types of challenges, so we thought it would be helpful to run through a hypothetical scenario and see how we work.

Related: Why website speed is crucial to business

Having trouble with scalability? Check out our 5 things toxic to scalability piece.

1. Contract outline

First we talk on the phone, or meet face to face and discuss what’s happening. Do you have one page that’s problematic? Is the website slow during certain hours? Or are you seeing erratic behavior and can’t point to a single source?

From there we outline a course of action, based on:

o talking with team, devs & architects
o reviewing systems first hand
o identifying bottlenecks and trouble spots

This with this outline we’ll include an estimate of the number of work days it’ll take to complete. We’ll then send that back to you for review, exchange a deposit and set a start date.

2. Meet team & discuss architecture

Next we’ll meet the team and review the problems in more technical detail. If you’re in NYC we’ll probably make a stop into your offices and have a warm meet & greet. If you’re located further afield we can either meet over a skype call, or arrange for us to travel to your location for the start of the engagement.

3. Measure current throughput

In order to get a sense of the current state of the systems we’ll measure some system metrics. This could be load average or queries per second or other MySQL internal metrics. We’ll also look at some business metrics such as speed of an ecommerce checkout, or a speed test on a particularly slow page.

These metrics are designed to create a baseline of where things are before any changes are made.

[quote]Measuring both business and system metrics before and after changes, allow a rough ROI measurement to be done. This goes a long way towards justifying the expense of a performance review, current and future.[/quote]

4. Review systems, configurations & setups

Next we’ll jump on the various systems and review configurations. This includes webservers, caching servers and the database servers as necessary. We’ll review memory settings, important configurations, all the dials and switches.

Along with this we’ll also review development and architecture. Are you using Java with Hibernate a popular ORM? Or perhaps CakePHP? Are you writing custom SQL code? Are developers up to speed with EXPLAIN and query profiling? For that matter is code in version control?

Just looking for a DBA? Check out our MySQL Hiring Guide.

5. Report on actionable advice & findings

Perhaps the most essential and useful part of an initial engagement is our overall findings and review report. We’ve found these are very valuable to firms as they speak to a lot of folks up and down the business hierarchy. They speak to management about high level architectural problems and structural or process related challenges. And they can speak well to developers and operations teams as they provide a third party birds eye view of day-to-day activities.

Take a look at a sample report we’ve prepared for Acme StartUp, Inc.

6. Discuss which steps to move on

From here we’ll meet again. In particular we’ll review the actionable advice. Some changes will be low cost, requiring no downtime, while others might require a downtime window. Further medium term changes might require refactoring some code and deploying. Typically the larger longer term architecture changes will also be outlined.

Based on time & costs, we’ll decide together which changes are a priority. Obviously we’ll want to move on low hanging fruit first, and move forward from there.

Want to learn more about us? Check out our testimonials and our about page.

7. Take action on agreed changes

Once we’ve decided which changes we’ll make, we’ll schedule downtime windows as needed and make the changes to systems. From there we’ll carefully observe everything for stability, and no adverse affects.

8. Measure throughput again

Based on the throughput measurements in #3 above, we’ll perform those same benchmarks again. We’ll check low level system metrics, along with higher level business & user based throughput. Both of these are important as they can provide different perspectives on changes made.

For example if the system metrics improve markedly, but the business or user metrics do not, we know are change had some affect on overall performance, but likely we did not identify the one which directly is causing the business slowdown.

9. Summarize findings & performance gain

In the most likely case they both improve markedly, and we can measure the improvements from our entire process of performance review.

This can be helpful and measuring overall return on investment for the engagement. ROI is obviously an important exercise as we want to know that the money is well spent.

10. Document solutions & recommendations

The last step is to document what we did and what we learned. This allows us to carry forward that knowledge and keep applying it to the development and operations process. This allows the business to continue adding value from the engagement even after it’s completed.

Read this far? Grab our newsletter.

10 ways I avoid trouble in database operations

1. Avoid destructive commands

From time to time I’m working with new recruits and bringing them up to speed in operations. The first thing I emphasize is care with destructive commands.

What do I mean here? Well there are all sorts of them. SQL commands such as DROP table & DROP database. But also TRUNCATE and DELETE are all destructive. They’re easy to execute but harder to undo. Think of all the steps it would take to restore from your backup.

If you are logged in as root there are many many ways to shoot your own foot. I hope you know this right? rm has lots of options that can be very difficult to step back from like -r (recursive) and -f (force). Better to not use the command at all and just move the file or directory you’re working on by renaming it. You can always delete later.

2. Set your command prompts

When working on the command line, your prompt is crucial. You check it over and over to make sure you’re working on the right box. At the OS, your prompt can tell you if you’re root or not, what directory you’re sitting in, and what’s the hostname of the box. With a few different terminals open, it’s very easy to execute a heavy loading command or destructive command on the wrong box. Check thrice, cut once!

You can also set your mysql prompt too. This can provide similar insurance. It can tell you the database schema you’re set at default, and the user you’re logged in as. Hostname or localhost too. It is one more piece in the risk aversion puzzle.

3. Perform backups & test them

I know I know, we’re all doing backups already. Well I sure hope so. But if you’re getting on a system for the first time, it should be your very initial impulse to check and find out what types of backups are being done. If they’re not, you should set them up. I don’t care how big the database is. If it’s an obstacle, you need to sell or educate management on what might happen if. Paint some ugly scenarios. It’s not always easy to see urgency in these things without a good war story or two.

We wrote a guide to using xtrabackup for hotbackups. These can be done online even while your production database is serving customers without table locking or other downtime.

4. Stay off production machines

This may sound funny to some of you, but I live by it. If it ain’t broke, don’t go and try to fix it! You don’t need to be on all these boxes all the time. That goes for other folks too. Don’t give devs access to every production box. Too many hands in the pie so to speak. Also limit root users. But again if those systems are running well, you don’t have to login to them and poke around every five minutes. This just brings more chances for operator error.

5. Avoid change as much as possible

This one might sound controversial but it’s saved me more than once.

I worked at one firm a few years back managing the MySQL servers. The Oracle DBA was going on vacation for a few weeks so I was picking up the reigns for a bit. I met with the DBA for some brain dump sessions, and he outlined the main things that can and do go wrong. He also asked that I avoid any table alterations.

Sure enough ten days into his vacation, a problem arose in the application. One page on the site was failing silently. There was a missing field which needed to be added. I resisted. A fight ensued. Suddenly a lot of money was at stake if this change wasn’t pushed through. I continued to resist. I explained that if such a change were not done correctly, it very likely would break replication, pushing a domino of other things to break and causing an unpredictable mess.

I also knew I only had to hold on for a few more days. The resident dba would be returning and he could juggle the change. You see Oracle was setup to use multi-master replication those changes needed to go through a rather complex process to be applied. Done incorrectly the damage would have taken days to cleanup and caused much more financial damage.

The DBA was very thankful at my resistance and management somewhat magically found a solution to the application & edit problem.

Push back is very important sometimes.

[quote]
Many of these ten tips are great characteristics to select for in the DBA hiring process. If you’re a candidate, emphasize your caution and track record with uptime. If you’re a manager, ask candidates about how they handle these situations. We wrote a MySQL DBA hiring guide too.

[/quote]

6. Monitor important things

You should monitor your OS syslog and MySQL error log for starters. But also your slow query log for new activity, analyze them and send the reports along to devs. Provide analysis. Monitor your partitions. You don’t ever want disks to fill up. Monitor load average, and have a check that the database login or some other simple transaction can succeed. You can even monitor your backups to make sure they complete without error. Use your judgement to decide what checks satisfy these requirements.

7. Use one or more slaves & checksum

MySQL slave databases are a great way to provide insurance. You can use a lagging slave to provide insurance against operator error, or one of those destructive commands we mentioned above. Have it lag a few hours behind so you’ll have that much insurance. At night this slave may be fresh enough to use for backups.

Also since mysql uses statement based replication, data can get out of sync over time. Those problems may or may not flag errors. So use a tool to compare your master and slave for data consistency. We wrote a howto on using checksums to do just that.

8. Be very careful of automatic failover

Automation is wonderful when it works. We dream of a data center that works like clockwork, with robots that never sleep. We can work towards this ideal, and in some cases get close. But it’s important to also understand that failure is by nature *not* what we predicted. The myriad ways that complex systems can fail boggles the mind, and surprises even seasoned veterans of operations. So maintain a heathy suspicion of this type of automation. Understand that if you automate things to happen in this crucial time, you can potentially put yourself in an even *more* compromised position than simply failing.

Sometimes monitoring, alerting, and manual intervention are the more prudent path. Your mileage may vary of course.

9. Be paranoid

It takes many years of doing ops to realize you can never be paranoid enough. Already checked that you’re on the right host, and about to execute some command? Quit the shell prompt and check again. Go back and ask the team if that table really needs to be dropped. Try to rephrase what you’re about to do in different words. Email out again to the team and wait some time before you pull the trigger. Check one more time that you have a fresh backup.

Delay that destructive command as long as you possibly can.

10. Keep it simple

I know I know, we all want to use that new command or tool, or jump on the latest hardware and take it for a spin. We want to build beautiful architectures that perform great feats of magic. But the fewer moving parts, the less things that can go wrong. And in ops, your job is stability and availability. Can you avoid using multi-master replication and go with just basic master-slave replication in MySQL? That’s simpler. Can you have fewer schemas or fewer filter rules? Can you skip the complicated HA layer, and use monitoring and manual failover?

Made it this far? Grab our newsletter.

Hiring is a numbers game


On a recent twitter chat (#hfchat) I posted some comments about hiring. Some folks were complaining that they had applied to various jobs, and not heard back.

I commented…

[quote]Apply for a job and don’t hear back, it’s nothing personal[/quote]

In today’s market, there are hundreds of job applicants for every position. Sad to say, but that means things become a blur after a while. There’s less chance to sift each candidate and find out who they really are or what they really know. It’s more about keywords, and buzzwords if you must, to get your foot in the door.

But there is a flip side to this coin, which I think many job seekers forget sometimes.

[quote]job seekers: apply to enough positions so that you forget to followup sometimes.[/quote]

Imagine that, you’ve applied to so many positions and heard back from a bunch that you take it for granted it a bit that you’ll surely here back from others.

We might also argue that to some degree, especially early on when you are building your reputation, this numbers game is at play in consulting too. The more people you get in front of the more you’ll practice honing your message. At the same time more people will find out about you, and talk about you. We have a consulting 101 guide we know you’ll enjoy.

If I were to offer a few other nuggets of advice I’d suggest:

o Hone your resume for keywords and search
o Test your linkedin profile – search those keywords
o Edit your cover letter to be short & punchy!
o Throw in some buzzwords – a little rockstar this and agile that!

Looking for specific advice for tech jobs? We wrote a hiring guide for a MySQL DBA. These are equally helpful to job candidates, and those who are interviewing them. Anyone know why are operations & MySQL DBAs so hard to find these days?.

Read this far? Grab our newsletter scalable startups.

Why you should attend Percona Live 2012

What I loved about Percona Live 2011

Last year I was excited to go to Percona Live for the first time in NYC. I arrived just in time to hear Harrison Fisk from Facebook speak about some of the awesome tweaks they’re running with MySQL there. It’s not everyday that you get to hear from top MySQL engineers how they’re using the technology and what their biggest challenges are. If they can make MySQL hum, so can the rest of us!

Afterward, outside in the foyer, I ran into all sorts of luminaries in the MySQL space. Percona folks like Peter Zaitsev & Vadim Tkachenko, plus other big names like Baron Schwartz, Harrison, and Ronald Bradford. I ran into people from firms like Yahoo, Google, Daniweb, Pythian, SkySQL & Palomino.

You might also like our Setup MySQL Replication with Hotbackups as well as How to deploy MySQL on Amazon EC2 servers articles.

What to expect at Percona Live NYC 2012

This years event next month features rockstar engineers from an incredible lineup of firms including Etsy, New Relic, Youtube, Paypal, Tumblr, SugarCRM, Square, and of course a few from Percona themselves. I promise you this, these talks won’t be salesy or in any way a waste of your time and money. They will be thoroughly technical talks, with cutting edge insights and advice from those in the trenches using the technology everyday.

If I wasn’t heading to Oracle Open World for the publishers seminar & MySQL Connect, I would most certainly be there. In fact I had originally been slated to talk about point-in-time recovery in MySQL. Oh well, I’m sure I’ll catch you at the Percona Live in April 2013.

If you do decide to attend please enjoy a 15% discount with code “SeanHull” !

Looking to hire top MySQL talent? Check out our MySQL DBA Hiring Guide with advice for managers, recruiters, and candidates too! We also have an enduringly popular article about the mythical MySQL DBA and why they’re hard to find.

Also if you’ve read this far, please grab my newsletter scalable startups.

Upcoming for Scalable Startups

Just back from the Labor Day holiday, and ready to dive back in.

I thought this would be a great time to outline some of our upcoming topics so here goes…

1. Why Oracle usability sucks

– a rant about Oracle’s weak points

In the meantime take a peek at our piece on why we wrote the book on Oracle & Open Source. We ruminate on trends in the datacenter and take a stab at Oracle’s future.

2. Why relational databases don’t scale

– Is there any such thing as automatic scalability?
– What blocks scalability?
– Are NoSQL databases magic?

Also one of our articles that went viral – 5 things toxic to scalability

3. Eternal tension between dev & operations

– origin in different job roles & priorities
– balance found in each appreciating the others point of view
– hiring the best or building the right culture

You might enjoy a wildly popular piece we wrote a few months back How to hire a developer that doesn’t suck.

4. MySQL Query Tuning Cheatsheet

– SQL queries are hellish to tune, that we know.
– An outline of some of the common patterns that don’t work will help you identify and avoid them.

5. Differentiation in professional services

– A commodity does not stand out in the services business
– Differentiation is about personality, relationships & how you solve problems for the business

Also in the meantime take a look at our professional services 101 guide.

Read this far? Grab our monthly scalable startups newsletter.