Category Archives: Scalability

3 Ways to Boost Cloud Scalability

Deploying in the Amazon cloud is touted as a great way to achieve high scalability while paying only for the computing power you use. How do you get the best scalability from the technology? Continue reading 3 Ways to Boost Cloud Scalability

5 Ways to Boost MySQL Scalability

There are a lot of scalability challenges we see with clients over and over. The list could easily include 20, 50 or even 100 items, but we shortened it down to the biggest five issues we see.

1. Tune those queries

By far the biggest bang for your buck is query optimization. Queries can be functionally correct and meet business requirements without being stress tested for high traffic and high load. This is why we often see clients with growing pains, and scalability challenges as their site becomes more popular. This also makes sense. It wouldn’t necessarily be a good use of time to tune a query for some page off in a remote corner of your site, that didn’t receive real-world traffic. So some amount of reactive tuning is common and appropriate.

Enable the slow query log and watch it. Use mk-query-digest, the great tool from Maatkit to analyze the log. Also make sure the log_queries_not_using_indexes flag is set.  Once you’ve found a heavy resource intensive query, optimize it!  Use the EXPLAIN facility, use a profiler, look at index usage and create missing indexes, and understand how it is joining and/or sorting.

Also: Why generalists are better at scaling the web

2. Employ Master-Master Replication

Master-master active-passive replication, otherwise known as circular replication, can be a boon for high availability, but also for scalability.  That’s because you immediately have a read-only slave for your application to hit as well.  Many web applications exhibit an 80/20 split, where 80% of activity is read or SELECT and the remainder is INSERT and UPDATE.  Configure your application to send read traffic to the slave or rearchitect so this is possible.  This type of horizontal scalability can then be extended further, adding additional read-only slaves to the infrastructure as necessary.

If you’re setting up replication for the first time, we recommend you do it using hotbackups. Here’s how.

Keep in mind MySQL’s replication has a tendency to drift, often silently from the master. Data can really get out of sync without throwing errors! Be sure to bulletproof your setup with checksums.

Related: Why you can’t find a MySQL DBA

3. Use Your Memory

It sounds very basic and straightforward, yet there are often details overlooked.  At minimum be sure to set these:

  • innodb_buffer_pool_size
  • key_buffer_size (MyISAM index caching)
  • query_cache_size – though beware of issues on large SMP boxes
  • thread_cache & table_cache
  • innodb_log_file_size & innodb_log_buffer_size
  • sort_buffer_size, join_buffer_size, read_buffer_size, read_rnd_buffer_size
  • tmp_table_size & max_heap_table_size

Read: Why Twitter made a shocking admission about their data centers in the IPO

4. RAID Your Disk I/O

What is underneath your database?  You don’t know?  Well please find out!  Are you using RAID 5?  This is a big performance hit.  RAID5 is slow for inserts and updates.  It is also almost non-functional during a rebuild if you lose a disk.  Very very slow performance.  What should I use instead?  RAID 10 mirroring and striping, with as many disks as you can fit in your server or raid cabinet.  A database does a lot of disk I/O even if you have enough memory to hold the entire database.  Why?  Sorting requires rearranging rows, as does group by, joins, and so forth.  Plus the transaction log is disk I/O as well!

Are you running on EC2?  In that case EBS is already fault tolerant and redundant.  So give your performance a boost by striping-only across a number of EBS volumes using the Linux md software raid.

Also checkout our Intro to EC2 Cloud Deployments.

Also of interest autoscaling MySQL on EC2.

Also: Why startups are trying to do without techops and failing

5. Tune Key Parameters

These additional parameters can also help a lot with performance.

innodb_flush_log_at_trx_commit=2

This speeds up inserts & updates dramatically by being a little bit lazy about flushing the innodb log buffer.  You can do more research yourself but for most environments this setting is recommended.

innodb_file_per_table

Innodb was developed like Oracle with the tablespace model for storage.  Apparently the kernel developers didn’t do a very good job.  That’s because the default setting to use a single tablespace turns out to be a performance bottleneck.  Contention for file descriptors and so forth.  This setting makes innodb create tablespace and underlying datafile for each table, just like MyISAM does.

Read this: Why a four letter word still divides dev and ops

Made it to the end eh?!?! Grab our newsletter.

Migrating to the Cloud – Why and why not?

A lot of technical forums and discussions have highlighted the limitations of EC2 and how it loses  on performance when compared to physical servers of equal cost.  They argue that you can get much more hardware and bigger iron for the same money.  So it then seems foolhardy to turn to the cloud.  Why this mad rush to the cloud then?  Of course if all you’re looking at is performance, it might seem odd indeed.  But another way of looking at it is, if performance is not as good, it’s clearly not the driving factor to cloud adoption.

CIOs and CTOs are often asking questions more along the lines of, “Can we deploy in the cloud and settle with the performance limitations, and if so how do we get there?”

Another question, “Is it a good idea to deploy your database in the cloud?”  It depends!  Let’s take a look at some of the strengths and weaknesses, then you decide.

8 big strengths of the cloud

  1. Flexibility in disaster recovery – it becomes a script, no need to buy additional hardware
  2. Easier roll out of patches and upgrades
  3. Reduced operational headache – scripting and automation becomes central
  4. Uniquely suited to seasonal traffic patterns – keep online only the capacity you’re using
  5. Low initial investment
  6. Auto-scaling – set thresholds and deploy new capacity automatically
  7. Easy compromise response – take server offline and spinup a new one
  8. Easy setup of dev, qa & test environments

Some challenges with deploying in the cloud

  1. Big cultural shift in how operations is done
  2. Lower SLAs and less reliable virtual servers – mitigate with automation
  3. No perimeter security – new model for managing & locking down servers
  4. Where is my data?  — concerns over compliance and privacy
  5. Variable disk performance – can be problematic for MySQL databases
  6. New procurement process can be a hurdle

Many of these challenges can be mitigated against.  The promise of the infrastructure deployed in the cloud is huge, so digging our heels in with gradual adoption is perhaps the best option for many firms.  Mitigate the weaknesses of the cloud by:

  • Use encrypted filesystems and backups where necessary
  • Also keep offsite backups inhouse or at an alternate cloud provider
  • Mitigate against EBS performance – cache at every layer of your application stack
  • Employ configuration management & automation tools such as Puppet & Chef

Quora discussion – Why or why not to migrate to the cloud?

Semi-Synchronous Replication – What is it and why is it important?

Replication in MySQL allows you to copy and replay changes from your primary database to an alternate backup or slave database.  This facility in MySQL is an asynchronous process, which means it does not happen at the time it occurs on the primary.  It could happen a second later, or minutes later.  In fact sometimes the secondary database can get bogged down by heavy load because transactions are applied serially, while they execute in parallel sessions on production. You can find out how far behind the master you are with SHOW SLAVE STATUS, and look at:

Seconds_Behind_Master: 8

If you are sending SELECT or the READ traffic from your website to the slave database, you may experience phantom reads.  For instance if you comment on a blog posting, and refresh the page within 8 seconds on the server above, it would not display the comment just posted!

As it turns out the Maatkit toolkit has a tool called mk-slave-prefetch which can help with slow performance of the slave.  Since most of the work of doing inserts, updates and deletes involves fetching the right rows, running a similar SELECT query in advance of running the actual transaction will warm up the caches, and speed things up dramatically and may be enough for your needs.  Test it first and find out.

Semi-Synchronous Replication comes to the rescue if you really need this type of guarantee, but it comes at a cost.  You enable it on the master, then on the slave and restart the slave.  Whenever the master commits a transaction, it will block until one of two things happen.  It must either get an acknowledgement from at least one slave that the transaction has been applied downstream or it must reach the timeout threshold.

This type of arrangement may sound fine in theory as such blocking would often be less than a second.  However in the microscopic world of high speed, high transaction, high traffic websites, this may be an eternity, and one which can slow the database down substantially.  So test first before assuming it’s a solution that will help you.

Quora discussion of Semi-synchronous Replication

Sharding – What is it and why is it important?

Sharding is a way of partitioning your datastore to benefit from the computing power of more than one server.  For instance many web-facing databases get sharded on user_id, the unique serial number your application assigns to each user on the website.

Sharding can bring you the advantages of horizontal scalability by dividing up data into multiple backend databases.  This can bring tremendous speedups and performance improvements.

Sharding, however has a number of important costs.

  • reduced availability
  • higher administrative complexity
  • greater application complexity

High Availability is a goal of most web applications as they aim for always-on or 24×7 by 365 availability.  By introducing more servers, you have more components that have to work flawlessly.  If the expected downtime of any one backend database is 1/2 hour per month and you shard across five servers, your downtime has now increased by a factor of five to 2.5 hours per month.

Administrative complexity is an important consideration as well.  More databases means more servers to backup, more complex recovery, more complex testing, more complex replication and more complex data integrity checking.

Since Sharding keeps a chunk of your data on various different servers, your application must accept the burden of deciding where the data is, and fetching it there.  In some cases the application must make alternate decisions if it cannot find the data where it expects.  All of this increases application complexity and is important to keep in mind.

Sean Hull asks on Quora – What is Sharding and why is it important?

Capacity Planning – What is it and why is it important?

Look at your website’s current traffic patterns, pageviews or visits per day, and compare that to your server infrastructure. In a nutshell your current capacity would measure the ceiling your traffic could grow to, and still be supported by your current servers. Think of it as the horsepower of you application stack – load balancer, caching server, webserver and database.

Capacity planning seeks to estimate when you will reach capacity with your current infrastructure by doing load testing, and stress testing. With traditional servers, you estimate how many months you will be comfortable with currently provisioned servers, and plan to bring new ones online and into rotation before you reach that traffic ceiling.

Your reaction to capacity and seasonal traffic variations becomes much more nimble with cloud computing solutions, as you can script server spinups to match capacity and growth needs. In fact you can implement auto-scaling as well, setting rules and thresholds to bring additional capacity online – or offline – automatically as traffic dictates.

In order to be able to do proper capacity planning, you need good data. Pageviews and visits per day can come from your analytics package, but you’ll also need more complex metrics on what your servers are doing over time. Packages like Cacti, Munin, Ganglia, OpenNMS or Zenoss can provide you with very useful data collection with very little overhead to the server. With these in place, you can view load average, memory & disk usage, database or webserver threads and correlate all that data back to your application. What’s more with time-based data and graphs, you can compare changes to application change management and deployment data, to determine how new code rollouts affect capacity requirements.

Sean Hull asks about Capacity Planning on Quora.

Scalability – What is it and why is it important?

Scaling comes in a few different flavors.  Vertical scaling involves growing the computing power of a single server, adding memory, faster or more CPUs and/or faster disk I/O.

Horizontal scaling involves adding additional computing resources or servers in parallel and then load balacing across them.

Scalability refers to applications which facilitate scaling well.  With web applications, the middle tier aka the webservers are fairly easy to scale horizontally and most enterprise class applications already do this with commercial load balancers – with either hardware or software.

Doing the same with the database tier, however can be trickier.  Enter MySQL replication to facilitate a fairly painless horizontal scalability.  Build your application architecture with read-only transactions, and write/update transactions segmented apart, and you can send the latter to one master database, and the former to a handful of replicated slaves.  With a typical web application that is less than 10% writes, and 90% reads, there is the potential to add as many as 5-10 servers horizontally to increase application throughput by as much as 500-1000%.

Sean Hull asks on Quora: What is scalability and why is it important?

5 Tips for Scalability

Your website is slow but you’re not sure why.  You do know that it’s impacting your business.  Are you losing customers to the competition? Here are five quick tips to achieve scalability

1. Gather Intelligence

With any detective work you need information.  That’s where intelligence comes in.  If you don’t have the right data already, install monitoring and trending systems such as Cacti and Collectd.  That way you can look at where your systems have been and where they’re going.

2. Identify Bottlenecks

Put all that information to use in your investigation.  Use stress testing tools to hit areas of the application, and identify which ones are most troublesome.  Some pages get hit A LOT, such as the login page, so slowness there is more serious than one small report that gets hit by only  a few users.  Work on the biggest culprits first to get the best bang for your buck.

3. Smooth Out the Wrinkles

Reconfigure your webservers to make more connections to your database, or spin-up more servers.  On the database tier make sure you have fast RAIDed disk, and lots of memory.  Tune queries coming from your application, and look at possible upgrades to servers.

4. Be Agile But Plan for the Future

Can your webserver tier scale horizontally?  Pretty easy to add more servers under a load balancer.  How about your database.  Chances are with a little work and some HA magic your database can scale out with more servers too, moving the bulk of select operations to read-only copies of your primary server, while letting it focus on transactions, and data updates.  Be ready and tested so you know exactly how to add servers without impacting the customers or application.  Don’t know how?  Look at the big guys like Facebook, an investigate how they’re doing it.

5. A Going Concern

Most importantly, just like your business, your technology infrastructure is an ongoing work in progress.  Stay proactive with monitoring, analysis, trending, and vigilance.  Watch application changes, and filter for slow queries.  Have new hardware or additional hardware dynamically at-the-ready for when you need it.