The truth that is half right. True there are a lot of moving parts, and a lot to setup. The internet stack made up of Linux, Apache, MySQL & PHP. LAMP as it’s called, was built to be resilient, dynamic, and scalable. It’s essentially why Amazon works. Why what they’re doing is possible. Windows & .NET for example don’t scale well. Strange to see Oracle mating with them, but I digress…
Linux and LAMP that is built on top of it, are highly scalable and dynamic to begin with.
The answer is possibly, but 9 times out of 10 what really needs to happen is queries must be tuned.
In 17 years of consulting that is the single largest cause of scalability problems. Fix those queries and your problems are over.
The Elephant in the Room – Query Tuning
I was talking with a colleague today at AppNexus. He said, so should we do some of that work inside the application, instead of doing a huge UNION or a large JOIN? I said yes you can move work onto the application, but it makes the application more complex. On the flip side the webserver tier is easier to scale. So there are tradeoffs.
I said this:
By and large, if scalability is our goal, we should work to quiet the activity in the slow query log. This is an active project for developers & DBAs. Keep it quiet and your server will run well.
Yet I still talk to teams where this is mysterious. It’s unclear. There’s no conviction there. And that’s where I think DBAs are failing. Because this is our subject matter expertise, and if we haven’t convinced developer teams of this, we’re not working together enough. API teams aren’t separate from DBA and operations. Siloing technology departments is a killer…
As you roll out new code, if some queries show up, then those need attention. Tweak the code until the queries drop out. This is the primary project of scalability.
When should I think about upgrading hardware?
If your code is stable, but you’re seeing a steady line rising on load average of the server, *THEN* go up in hardware. Load average means cpu & disk are being taxed. The server can’t keep up.
I close with a final point. Devops means bring dev & ops together! Don’t silo them off in different wings. Communicate. DBAs it’s your job to educate Developers about scalability and help with query tuning. Devs, profile new SQL code, test with large datasets & for god sakes don’t use an ORM – it’s one of 5 things toxic to scalability. Run explain and be sure to index all the right columns.
Lots and lots of web applications need to page through information. From customer records, to the albums in your itunes collection. So as web developers and architects, it’s important that we do all this efficiently.
Start by looking at how you’re fetching information from your MySQL database. We’ve outlined three ways to do just that.
Of course such a solution would only work if you were paging by ID. If you page by name, it might get messier as there may be more than one person with the same name. If ID doesn’t work for your application, perhaps returning paged users by USERNAME might work. Those would be unique:
SELECT id, username
WHERE username > ‘firstname.lastname@example.org’
ORDER BY username LIMIT 10;
Paging queries can be slow with SQL as they often involve the OFFSET keyword which instructs the server you only want a subset. However it typically scans collects and then discards those rows first. With deferred join or by maintaining a place or position column you can avoid this, and speedup your database dramatically.
2. Try using a Deferred Join
This is an interesting trick. Suppose you have pages of customers. Each page displays ten customers. The query will use LIMIT to get ten records, and OFFSET to skip all the previous page results. When you get to the 100th page, it’s doing LIMIT 10 OFFSET 990. So the server has to go and read all those records, then discard them.
SELECT id, name, address, phone FROM customers ORDER BY name LIMIT 10 OFFSET 990;
MySQL is first scanning an index then retrieving rows in the table by primary key id. So it’s doing double lookups and so forth. Turns out you can make this faster with a tricky thing called a deferred join.
The inside piece just uses the primary key. An explain plan shows us “using index” which we love!
ORDER BY name
LIMIT 10 OFFSET 990;
Now combine this using an INNER JOIN to get the ten rows and data you want:
SELECT id, name, address, phone
INNER JOIN (
ORDER BY name
LIMIT 10 OFFSET 990)
AS my_results USING(id);
That’s pretty cool!
3. Maintain a Page or Place column
Another way to trick the optimizer from retrieving rows it doesn’t need is to maintain a column for the page, place or position. Yes you need to update that column whenever you (a) INSERT a row (b) DELETE a row ( c) move a row with UPDATE. This could get messy with page, but a straight place or position might work easier.
SELECT id, name, address, phone
WHERE page = 100
ORDER BY name;
Why does it do this? UNION is defined that way in SQL. Duplicates must be removed and this is an efficient way for the MySQL engine to remove them. Combine results, sort, remove duplicates and return the set.
Queries with UNION can be accelerated in two ways. Switch to UNION ALL or try to push ORDER BY, LIMIT and WHERE conditions inside each subquery. You’ll be glad you did!
What if we did UNION ALL? The result would look like this:
Here the WHERE clause works on this 11 record temp table:
But it would be much faster to move the WHERE inside each subquery like this:
(SELECT type, release FROM short_sleeve WHERE release >=2013)
(SELECT type, release FROM long_sleeve WHERE release >=2013);
That would be operating on a combined 3 record table. Faster to sort & remove duplicates. Smaller result sets cache better too, providing a pay forward dividend. That’s what performance optimization is all about!
Remember multi-million row sets in each part of this query will quickly illustrate the optimization. We’re using very small results to make visualizing easier.
You can also use this optimization for ORDER BY and for LIMIT conditions. By reducing the number of records returned by EACH PART of the UNION, you reduce the work that happens at the stage where they are all combined.
If you’re seeing some UNION queries in your slow query log, I suggest you try this optimization out and see if you can tweak
A relational database is the warehouse of your data. Your crown jewels. It’s your excel spreadsheet or filing cabinet writ large. You use them everyday and may not know it. Your smartphone stores it’s contact database in a relational database, most likely sqlite – the ever present but ever invisible embedded database platform. Your online bank at Citibank or Chase stores all your financial history, statements, contact info, personal data and so forth, all in a relational database.
organized around records
data points are columns in a table
relationships are enforced with constraints
indexing data brings hi-speed access
SQL is used to get data in and out of the db
triggers, views, stored procs & materialized views may also be supported
Like excel, relational databases are organized around records. A record is like a 3×5 card with a number of different data points on it. Say you have 3×5 cards for your addressbook. Each card holds one address, phone number, email, picture, notes and so forth. By organizing things nicely on cards, and for each card predictable fields such as first name, last name, birthday etc, you can then search on those data points. Want all the people in your addressbook with birthday of July 5th, no problem.
While relational databases have great advantages, they require a lot of work to get all of your information into neatly organized files. What’s more the method for getting things into and out of them – SQL is a quirky and not very friendly language. What’s more relational databases have trouble clustering, and scaling horizontally. NOSql database have made some headway in these departments, but at costs to consistency and reliability of data.
As servers continue to get larger, it becomes rarer that a single web-facing database really needs more than one single server. If it’s tuned right, that is. Going forward and looking to the future, the landscape will probably continue to be populated by a mix of traditional relational databases, new nosql type databases, key-value stores, and other new technologies yet to be dreamed up.
Queries are so-named because of the lovely language we call SQL – simplified query language. That’s a bit of sarcasm on my part, I’ve never found it particularly simple or elegant. Profiling them involves finding how they are spending their time, and what they are asking the server to do. In this way you can make them faster, and improve performance of the whole server, and thus your website.
At any rate queries ask the database for information. If they are on the simpler side, something like give me all the users whose name starts with “hu” for example, and last name is indexed, that will run very fast. The database will lookup in the index the last name field, and find the subset of ones starting with those letters, then go lookup the records by id and return them to you. Cache the index blocks, cache the data blocks. Great! However, say those are customers, and you want their calling cellphone calling history. Ok, now you have to join on another table, matching by some key, hold all those records in memory, shuffle them around, and so on.
So queries are effectively little jobs or bits of work you ask your database server to perform on your behalf. With websites you typically have hundreds of concurrently running sessions all sending their own little SQL jobs to the server to get processed, and records returned. And blip in the radar slows everyone down, so you want them to all run quickly.
That’s where profiling comes in. MySQL, Oracle, and SQL Server alike all have EXPLAIN type facilities for showing the plan with which the database will use to fetch your data. It shows indexes, sorting, joins, order, and so forth. All of this contributes to the overall execution time and resources used on the database server.
Object Relational Mappers or ORMs are a layer of software that sits between web developers and the database backend. For instance if you’re using Ruby as your web development language, you’ll interact with MySQL through an ORM layer called ActiveRecord. If you’re using Java, you may be fond of the ORM called Hibernate.
ORMs have been controversial because they expose two very different perspectives to software development. On the one hand we have developers who are tasked with building applications, fulfilling business requirements, and satisfying functional requirements in a finite amount of time. On the other hand we have operations teams which are tasked with managing resources, supporting applications, and maintaining uptime and availability.
Often these goals are opposing. As many in the devops movement have pointed out, these teams don’t always work together keeping common goals in mind. How does this play into the discussion of ORMs?
Relational databases are a technology developed in the 70’s that use an arcane language called SQL to move data in and out of them. Advocates of ORMs would argue rightly so, that SQL is cumbersome and difficult to write, and that having a layer of software which helps you in this task is a great benefit. To be sure it definitely helps the development effort, as software designers, architects and coders can focus more of their efforts on functional requirements and less on arcane minutiae of SQL.
Problems come when you bump up against scalability challenges. The operations team is often tasked with supporting performance requirements. Although this can often mean providing sufficient servers, disk, memory & cpu resources to support an application, it also means tuning the application. Adding hardware can bring you 2x or 5x improvement. Tuning an application can bring 10x or 100x improvement. Inevitably this involves query tuning.
That’s where ORMs become problematic, as they don’t promote tweaking of queries. They are a layer or buffer to keep query writing out of sight.
In our experience as performance and scalability experts for the past fifteen years, query tuning is the single biggest thing you can do to improve your web application. Furthermore some of the most challenging and troublesome applications we’ve been asked to tune have been built on top of ORMs like Hibernate.
SQL is a difficult acronym for a difficult language, but what it does is shuttle information into and out of your database in an organized manner. Your web applications and developers have to speak it, and your database – whether Oracle, MySQL, Postgres or some other will return information back using this computing dialect.
Since every movement on your website, from page to page (sessions) and purchase to purchase all involve interaction using these queries, writing them well can have a huge impact on your website performance. How big? We’ve fixed queries by adding indexes or rewriting them and seen improvements by as much as 100x. That’s converting pages that take ten seconds to ones that take 1/10 of a second. Be especially vigilant about those queries generated by Object Relational Mappers like Active Record, Ruby’s ORM layer.
There are a lot of forms of transportation, from walking to bike riding, motorcycles and cars to busses, trains and airplanes. Each mode of transport will get you from point a to point b, but one may be faster, or more comfortable and another more cost effective. It’s important to keep in mind when comparing databases like Oracle and MySQL that there are indeed a lot of feature differences, a lot of cultural differences, and a lot of cost differences. There are also a lot of impassioned people on both sides arguing at the tomfoolery of the other. Hopefully we can dispel some of the myths and discuss the topic fairly.
As a long time Oracle DBA turned MySQL expert, I’ve spent time with clients running both database engines and many migrating from one to the other. I can speak to many of the differences between the two environments. I’ll cover the following:
Query & Optimizer Limitations
Replication & HA Are Done Differently
Installation & Administration Simplicity
Watch Out – Triggers, Stored Procedures, Materialized Views & Snapshots