Categories
All Database Operations Scalability

5 things toxic to scalability

The.Rohit - Flickr
The.Rohit – Flickr

Check out our followup post 5 More Things Deadly to Scalability

If you’re using MySQL checkout 5 ways to boost MySQL scalability.

1. Object Relational Mappers

ORMs are popular among developers but not among performance experts.  Why is that?  Primarily these two engineers experience a web application from entirely different perspectives.  One is building functionality, delivering features, and results are measured on fitting business requirements.  Performance and scalability are often low priorities at this stage.  ORMs allow developers to be much more productive, abstracting away the SQL difficulties of interacting with the backend datastore, and allowing them to concentrate on building the features and functionality.


Scalability is about application, architecture and infrastructure design, and careful management of server components.

On the performance side the picture is a bit different.  By leaving SQL query writing to an ORM, you are faced with complex queries that the database cannot optimize well.  What’s more ORMs don’t allow easy tweaking of queries, slowing down the tuning process further.

Also: Is the difference between dev & ops a four-letter word?

2. Synchronous, Serial, Coupled or Locking Processes

Locking in a web application operates something like traffic lights in the real world.  Replacing a traffic light with a traffic circle often speeds up traffic dramatically.  That’s because when you’re out somewhere in the country where there’s very little traffic, no one is waiting idly at a traffic light for no reason.  What’s more even when there’s a lot of traffic, a traffic circle keeps things flowing.  If you need locking, better to use InnoDB tables as they offer granular row level locking than table level locking like MyISAM tables.

Avoid things like semi-synchronous replication that will wait for a message from another node before allowing the code to continue.  Such waits can add up in a highly transactional web application with many thousands of concurrent sessions.

Avoid any type of two-phase commit mechanism that we see in clustered databases quite often.  Multi-phase commit provides a serialization point so that multiple nodes can agree on what data looks like, but they are toxic to scalability.  Better to use technologies that employ an eventually consistent algorithm.

Related: Is automation killing old-school operations?

3. One Copy of Your Database

Without replication, you rely on only one copy of your database.  In this configuration, you limit all of your webservers to using a single backend datastore, which becomes a funnel or bottleneck.  It’s like a highway that is under construction, forcing all the cars to squeeze into one lane.  It’s sure to slow things down.  Better to build parallel roads to start with, and allow the application aka the drivers to choose alternate routes as their schedule and itinerary dictate.

Using MySQL? Checkout our our howto Easy Replication Setup with Hotbackups.

Read: Do managers underestimate operational cost?

4. Having No Metrics

Having no metrics in place is toxic to scalability because you can’t visualize what is happening on your systems.  Without this visual cue, it is hard to get business units, developers and operations teams all on the same bandwagon about scalability issues.  If teams are having trouble groking this, realize that these tools simple provide analytics for infrastructure.

There are tons of solutions too, that use SNMP and are non-invasive.  Consider Cacti, Munin, OpenNMS, Ganglia and Zabbix to name a few.  Metrics collections can involve business metrics like user registrations, accounts or widgets sold.  And of course they should also include low level system cpu, memory, disk & network usage as well as database level activity like buffer pool, transaction log, locking sorting, temp table and queries per second activity.

Also: Are SQL Databases dead?

5. Lack of Feature Flags

Applications built without feature flags make it much more difficult to degrade gracefully.  If your site gets bombarded by a spike in web traffic and you aren’t magically able to scale and expand capacity, having inbuilt feature flags gives the operations team a way to dial down the load on the servers without the site going down.   This can buy you time while you scale your webservers and/or database tier or even retrofit your application to allow multiple read and write databases.

Without these switches in place, you limit scalability and availability.

Also: Is high availability overrated? The myth of five nines…

Get more. Grab our exclusive monthly Scalable Startups. We share tips and special content. Our latest Why I don’t work with recruiters

Categories
All Cloud Computing CTO/CIO Database Operations High Availability Scalability Web Operations

5 Scalability Pitfalls to Avoid

1. Object Relational Mappers

Software development has always made use of libraries, off-the-shelf components that are shared between different projects.  These allow you to stand on the shoulders of others and build bigger things.  Frameworks do the same thing, they provide a context from which to build on.  Ruby on Rails for example provides a great starting framework from which to build web applications, managing sessions in an elegant way.

Categories
All Cloud Computing CTO/CIO Database Management Database Operations Disaster Recovery High Availability Scalability Startups Web Operations

Top 3 Questions From Clients

1. This page or area of the website is very slow, why?

There are a lot of components that make up modern internet websites, and a lot of places to get stuck in the mud.  Website performance starts with the browser, what caching it is doing, their bandwidth to your server, what the webserver is doing (caching or not and how), if the webserver has sufficient memory, and then what the application code is doing and lastly how it is interacting with the backend database.

Categories
All Cloud Computing Cloud Migrations CTO/CIO Scalability Web Operations

Cloud for Burst Capacity

One very strong case for cloud computing is that it can satisfy applications with seasonal traffic patterns.  One way to test the advantages of the cloud is through a hybrid approach.

Cloud infrastructure can be built completely through scripts.  You can spinup specific AMIs or machine images, automatically install and update packages, install your credentials, startup services, and you’re running.

All of these steps can be performed in advance of your need at little cost.  Simply build and test.  When you’re finished, shutdown those instances.  What you walk away with is scripts.  What do we mean?

The power here is that you carry zero costs for that burst capacity until you need it.  You’ve already build the automation scripts, and have them in place.  When your capacity planning warrants it, spinup additional compute power, and watch your internet application scale horizontally.  Once your busy season is over, scale back and disable your usage until you need it again.

Categories
All Cloud Computing Cloud Migrations CTO/CIO Database Operations Scalability Web Operations

3 Ways to Boost Cloud Scalability

Deploying in the Amazon cloud is touted as a great way to achieve high scalability while paying only for the computing power you use. How do you get the best scalability from the technology?

Categories
All Database Operations Scalability Web Operations

5 Ways to Boost MySQL Scalability

There are a lot of scalability challenges we see with clients over and over. The list could easily include 20, 50 or even 100 items, but we shortened it down to the biggest five issues we see.

1. Tune those queries

By far the biggest bang for your buck is query optimization. Queries can be functionally correct and meet business requirements without being stress tested for high traffic and high load. This is why we often see clients with growing pains, and scalability challenges as their site becomes more popular. This also makes sense. It wouldn’t necessarily be a good use of time to tune a query for some page off in a remote corner of your site, that didn’t receive real-world traffic. So some amount of reactive tuning is common and appropriate.

Enable the slow query log and watch it. Use mk-query-digest, the great tool from Maatkit to analyze the log. Also make sure the log_queries_not_using_indexes flag is set.  Once you’ve found a heavy resource intensive query, optimize it!  Use the EXPLAIN facility, use a profiler, look at index usage and create missing indexes, and understand how it is joining and/or sorting.

Also: Why generalists are better at scaling the web

2. Employ Master-Master Replication

Master-master active-passive replication, otherwise known as circular replication, can be a boon for high availability, but also for scalability.  That’s because you immediately have a read-only slave for your application to hit as well.  Many web applications exhibit an 80/20 split, where 80% of activity is read or SELECT and the remainder is INSERT and UPDATE.  Configure your application to send read traffic to the slave or rearchitect so this is possible.  This type of horizontal scalability can then be extended further, adding additional read-only slaves to the infrastructure as necessary.

If you’re setting up replication for the first time, we recommend you do it using hotbackups. Here’s how.

Keep in mind MySQL’s replication has a tendency to drift, often silently from the master. Data can really get out of sync without throwing errors! Be sure to bulletproof your setup with checksums.

Related: Why you can’t find a MySQL DBA

3. Use Your Memory

It sounds very basic and straightforward, yet there are often details overlooked.  At minimum be sure to set these:

  • innodb_buffer_pool_size
  • key_buffer_size (MyISAM index caching)
  • query_cache_size – though beware of issues on large SMP boxes
  • thread_cache & table_cache
  • innodb_log_file_size & innodb_log_buffer_size
  • sort_buffer_size, join_buffer_size, read_buffer_size, read_rnd_buffer_size
  • tmp_table_size & max_heap_table_size

Read: Why Twitter made a shocking admission about their data centers in the IPO

4. RAID Your Disk I/O

What is underneath your database?  You don’t know?  Well please find out!  Are you using RAID 5?  This is a big performance hit.  RAID5 is slow for inserts and updates.  It is also almost non-functional during a rebuild if you lose a disk.  Very very slow performance.  What should I use instead?  RAID 10 mirroring and striping, with as many disks as you can fit in your server or raid cabinet.  A database does a lot of disk I/O even if you have enough memory to hold the entire database.  Why?  Sorting requires rearranging rows, as does group by, joins, and so forth.  Plus the transaction log is disk I/O as well!

Are you running on EC2?  In that case EBS is already fault tolerant and redundant.  So give your performance a boost by striping-only across a number of EBS volumes using the Linux md software raid.

Also checkout our Intro to EC2 Cloud Deployments.

Also of interest autoscaling MySQL on EC2.

Also: Why startups are trying to do without techops and failing

5. Tune Key Parameters

These additional parameters can also help a lot with performance.

innodb_flush_log_at_trx_commit=2

This speeds up inserts & updates dramatically by being a little bit lazy about flushing the innodb log buffer.  You can do more research yourself but for most environments this setting is recommended.

innodb_file_per_table

Innodb was developed like Oracle with the tablespace model for storage.  Apparently the kernel developers didn’t do a very good job.  That’s because the default setting to use a single tablespace turns out to be a performance bottleneck.  Contention for file descriptors and so forth.  This setting makes innodb create tablespace and underlying datafile for each table, just like MyISAM does.

Read this: Why a four letter word still divides dev and ops

Made it to the end eh?!?! Grab our newsletter.

Categories
All Cloud Computing Cloud Migrations CTO/CIO Scalability Web Operations Website Basics

Migrating to the Cloud – Why and why not?

A lot of technical forums and discussions have highlighted the limitations of EC2 and how it loses  on performance when compared to physical servers of equal cost.  They argue that you can get much more hardware and bigger iron for the same money.  So it then seems foolhardy to turn to the cloud.  Why this mad rush to the cloud then?  Of course if all you’re looking at is performance, it might seem odd indeed.  But another way of looking at it is, if performance is not as good, it’s clearly not the driving factor to cloud adoption.

CIOs and CTOs are often asking questions more along the lines of, “Can we deploy in the cloud and settle with the performance limitations, and if so how do we get there?”

Another question, “Is it a good idea to deploy your database in the cloud?”  It depends!  Let’s take a look at some of the strengths and weaknesses, then you decide.

8 big strengths of the cloud

  1. Flexibility in disaster recovery – it becomes a script, no need to buy additional hardware
  2. Easier roll out of patches and upgrades
  3. Reduced operational headache – scripting and automation becomes central
  4. Uniquely suited to seasonal traffic patterns – keep online only the capacity you’re using
  5. Low initial investment
  6. Auto-scaling – set thresholds and deploy new capacity automatically
  7. Easy compromise response – take server offline and spinup a new one
  8. Easy setup of dev, qa & test environments

Some challenges with deploying in the cloud

  1. Big cultural shift in how operations is done
  2. Lower SLAs and less reliable virtual servers – mitigate with automation
  3. No perimeter security – new model for managing & locking down servers
  4. Where is my data?  — concerns over compliance and privacy
  5. Variable disk performance – can be problematic for MySQL databases
  6. New procurement process can be a hurdle

Many of these challenges can be mitigated against.  The promise of the infrastructure deployed in the cloud is huge, so digging our heels in with gradual adoption is perhaps the best option for many firms.  Mitigate the weaknesses of the cloud by:

  • Use encrypted filesystems and backups where necessary
  • Also keep offsite backups inhouse or at an alternate cloud provider
  • Mitigate against EBS performance – cache at every layer of your application stack
  • Employ configuration management & automation tools such as Puppet & Chef

Quora discussion – Why or why not to migrate to the cloud?

Categories
All Scalability Website Basics

Semi-Synchronous Replication – What is it and why is it important?

Replication in MySQL allows you to copy and replay changes from your primary database to an alternate backup or slave database.  This facility in MySQL is an asynchronous process, which means it does not happen at the time it occurs on the primary.  It could happen a second later, or minutes later.  In fact sometimes the secondary database can get bogged down by heavy load because transactions are applied serially, while they execute in parallel sessions on production. You can find out how far behind the master you are with SHOW SLAVE STATUS, and look at:

Seconds_Behind_Master: 8

If you are sending SELECT or the READ traffic from your website to the slave database, you may experience phantom reads.  For instance if you comment on a blog posting, and refresh the page within 8 seconds on the server above, it would not display the comment just posted!

As it turns out the Maatkit toolkit has a tool called mk-slave-prefetch which can help with slow performance of the slave.  Since most of the work of doing inserts, updates and deletes involves fetching the right rows, running a similar SELECT query in advance of running the actual transaction will warm up the caches, and speed things up dramatically and may be enough for your needs.  Test it first and find out.

Semi-Synchronous Replication comes to the rescue if you really need this type of guarantee, but it comes at a cost.  You enable it on the master, then on the slave and restart the slave.  Whenever the master commits a transaction, it will block until one of two things happen.  It must either get an acknowledgement from at least one slave that the transaction has been applied downstream or it must reach the timeout threshold.

This type of arrangement may sound fine in theory as such blocking would often be less than a second.  However in the microscopic world of high speed, high transaction, high traffic websites, this may be an eternity, and one which can slow the database down substantially.  So test first before assuming it’s a solution that will help you.

Quora discussion of Semi-synchronous Replication

Categories
All Scalability Web Operations Website Basics

Sharding – What is it and why is it important?

Sharding is a way of partitioning your datastore to benefit from the computing power of more than one server.  For instance many web-facing databases get sharded on user_id, the unique serial number your application assigns to each user on the website.

Sharding can bring you the advantages of horizontal scalability by dividing up data into multiple backend databases.  This can bring tremendous speedups and performance improvements.

Sharding, however has a number of important costs.

  • reduced availability
  • higher administrative complexity
  • greater application complexity

High Availability is a goal of most web applications as they aim for always-on or 24×7 by 365 availability.  By introducing more servers, you have more components that have to work flawlessly.  If the expected downtime of any one backend database is 1/2 hour per month and you shard across five servers, your downtime has now increased by a factor of five to 2.5 hours per month.

Administrative complexity is an important consideration as well.  More databases means more servers to backup, more complex recovery, more complex testing, more complex replication and more complex data integrity checking.

Since Sharding keeps a chunk of your data on various different servers, your application must accept the burden of deciding where the data is, and fetching it there.  In some cases the application must make alternate decisions if it cannot find the data where it expects.  All of this increases application complexity and is important to keep in mind.

Sean Hull asks on Quora – What is Sharding and why is it important?

Categories
All Scalability Web Operations Website Basics

Capacity Planning – What is it and why is it important?

Look at your website’s current traffic patterns, pageviews or visits per day, and compare that to your server infrastructure. In a nutshell your current capacity would measure the ceiling your traffic could grow to, and still be supported by your current servers. Think of it as the horsepower of you application stack – load balancer, caching server, webserver and database.

Capacity planning seeks to estimate when you will reach capacity with your current infrastructure by doing load testing, and stress testing. With traditional servers, you estimate how many months you will be comfortable with currently provisioned servers, and plan to bring new ones online and into rotation before you reach that traffic ceiling.

Your reaction to capacity and seasonal traffic variations becomes much more nimble with cloud computing solutions, as you can script server spinups to match capacity and growth needs. In fact you can implement auto-scaling as well, setting rules and thresholds to bring additional capacity online – or offline – automatically as traffic dictates.

In order to be able to do proper capacity planning, you need good data. Pageviews and visits per day can come from your analytics package, but you’ll also need more complex metrics on what your servers are doing over time. Packages like Cacti, Munin, Ganglia, OpenNMS or Zenoss can provide you with very useful data collection with very little overhead to the server. With these in place, you can view load average, memory & disk usage, database or webserver threads and correlate all that data back to your application. What’s more with time-based data and graphs, you can compare changes to application change management and deployment data, to determine how new code rollouts affect capacity requirements.

Sean Hull asks about Capacity Planning on Quora.