If you’re like a lot of folks you’re building an application in AWS & using a NoSQL database for persistent data. Dynamodb fits the bill nicely. Little or no ops to worry about, at least in the traditional sense.
However there are knobs to turn & dials to set. Here are a few you should be thinking about.
1. You can replicate across regions
Dynamodb introduced a feature in 2015 called streams. If you come from the relational database world, you can think of streams like a transaction log. It captures before & after image of your data. Couple those with useful lambda functions, and you have triggers that can do anything you want.
Dynamodb automatically creates and manages an index on the primary key. But chances are that your application will read data based on other columns too. You can create secondary indexes on these other columns, reducing your data access patterns. Without an index Dynamodb would have to scan every row to find your data, but the index can dramatically reduce this, and making data retrieval faster too!
That’s right, if you thought NoSQL meant no SQL you were only half right. By loading your Dynamodb data into HDFS, you can allow elastic map reduce to have at it. And thus open the door to use HiveQL to query the data the way you wanted to in the first place.
Convoluted? Yes. But this is the brave new world of the cloud!
By default dynamo is partitioning your data behind the scenes. Because that’s what good distributed databases are supposed to do. It does so using the primary key to figure out where the data should go. And just like with Redshift you have option of also using sort key to help the optimizer figure out how to distribute the data. This is important. Going across those different instances brings a lot of latency costs that will surprise you.
I was sifting through the CTO school email list recently, and the discussion of performance tuning came up. One manager had posted asking how to organize sprints, and break down stories for the process.
“Agile is not right for fixing performance issues.”
I agree with him & here’s why.
1. Agile roadblocks
At a very high level, agile seeks to organize work around sprints of a few weeks, and sets of stories within those sprints. The assumption here is that you have a set of identified issues. With software development, you have features you’re building. With performance tuning, it’s all about investigation.
How long will it take to solve the crime? Very good question!
Are you seeing general site slowness? Is there a particular feature that loads extremely slowly? Or is there a report that runs forever? Whatever it is, you must first be able to reproduce it. If it’s general site slowness, identify when it is happening.
Once you’ve reproduced your problem, next you want to start digging. Looking at logfiles can help you find errors, such as timeouts. The database has a slow query log, which you’ll definitely want to review. Slow queries can be surfaced by new code deploys, or middleware in front of your database, such as an ORM.
If you find your logfiles aren’t enabled, it’s a good first step to turn them on. Also look at how you’re caching. The browser should be directed to cache, assets should be on CDN, a page cache should protect your application server, and an object cache in front of your database.
As you dig deeper into your problem, you’ll likely uncover the root of your scalability problem. Likely causes include synchronous, serial or locking processes & requests, object relational modelers, lack of caching or new code that has not been tuned well.
This is what I think of as the fun part. You’ve measured the issues, found the problem. Now it’s time to fix it. This is an exciting moment, to bring real benefit to the business. Eliminating a performance problem can feel like springtime at the end of a long cold winter!
Another thing their dashboard does is illustrate their infrastructure clearly.
I can’t count the number of startups I’ve worked at where there are extra services running, odd side utility boxes performing tasks, and general disorganization. In some cases engineering can’t tall you what one service or server does.
By outlining the architecture here, they create a living network diagram that everyone benefits from.
If you scroll to the very bottom of the dashboard, you have two metrics. Homepage load time, and their “questions” page. The homepage is a metric everyone can look at, as many customers will arrive at your site though this portal. The questions page will be different for everyone. But there will be some essential page or business process that it highlights.
By sifting down to just these two metrics, we focus on what’s most important. All of this computing power, all these servers & networks are all working together to bring the fastest page load times possible!
This performance page doesn’t just face the business. It also faces the customers. It lets them know how important speed is, and can underscore how serious the business takes it’s customers. Having an outage or a spike that’s slowing you down. Customers have some transparency into what’s happening.
The FCC proposal will threaten *ANY* business that uses the internet to reach it’s customers.
Any business? Quite a sweeping statement. Strikes fear into me that’s for sure… And if you read through the comments, the debate is equally fierce. One side says net neutrality is socialism! The other side says anyone against net neutrality is a shill for Comcast or Verizon! Battle lines drawn!
1. Are all businesses at risk?
Isn’t the idea that ETSY will perish overstated? Are they a high bandwidth company? Are they trying to stream video?
Is the entire Etsy community alarmed? Isn’t that a rather broad statement?
To be sure ending net neutrality will impact some businesses. Perhaps one reason VC’s like Fred Wilson are so concerned about Net Neutrality isn’t for the freedom of millions of internet users, but the threat to disruptive businesses, the startups that VC’s directly invest in.
Here again some of this debate seems overstated. I remember using the internet on a dialup modem. 300 baud, was about the speed at which you can type. Then along came 14.4, 28k and upward speeds climbed. All the while the internet was usable. Could I do all the things I can today, nope.
Even if these horrible Comcast’s & Verizon’s reduce speeds by 100 times, they will still be plenty fast for most internet users. Sure streaming video would be impacted, and yes streaming music would be impacted. But for end users, I would argue most would not be impacted. It is rather the disruptive startups & businesses that would be most impacted.
In the mid-nineties, before the dot-com bubble, there was a huge raging debate about even having commercial entities on the internet at all. Enlightened internet cognoscenti considered it an abomination.
But the real world pushed it’s nose in, and today we take as a given.
“Research from Google & Microsoft shows that delays of milliseconds result in fewer page views and fewer sales in both the short & long term”. Yep, that’s a fact. The research shows this. But what do we take away from that?
As a performance and scalability consultant I see a *TON* of websites that have huge delays, well over tiny millisecond ones that Google frets over. Internet startups struggle with performance every day.
What’s the irony? Slowdowns that Comcast or Verizon might introduce to end users pale in comparison with these larger systemic problems.
5. Any lessons from sites of New York Fashion Week?
I like the Pingdom speed test tool. I used it to track the speed of some of the websites & blogs that are big for NYFW. Here’s what I found:
What do you see? Take a look at the SIZE column. Notice something strange? The LARGEST sites, in terms of images, css & assets aren’t necessarily the SLOWEST! That’s a funny result if you consider net neutrality. If you think the network speed is the same for all websites, shouldn’t the smallest pages load fastest?
Not true at all. It’s a very simplistic way of viewing things. Fashionista.com for example is doing a ton of tuning behind the scenes. As you can see it is making their site far and away the fastest! Network bandwidth and net neutrality be damned!
Techops is operational excellence. It’s the handoff when code is complete. These are the folks who are up at 2am when a website is down. They manage servers, keep the pipes clean, and the hackers out. They also help plan for capacity needs, and may help with load testing too.
If you’re new to technology, imagine a movie set. Story writers (programmers) have already done their part. The producers (venture capital folks) have financed the project. The director (architect) is there trying to put the vision together. But the folks who manage everything on set, from sound guys, camera guys, lighting people, and all the coordination, this is operations. In web application deployments it is devops, sysops or techops.
Notice how phenomenally well Obama for America project was run. Like a finely tuned machine. Harper Reed and team pulled off one of the most data backed election campaigns in history.
That project used AWS cloud technologies to the fullest, from devops tools like Puppet and Asgard, collaboration tools like Campfire & Github, and superb monitoring & instrumentation tools NewRelic and Chartbeat.
Clearly Obama knows how to run an election. Something is drastically different with the healthcare.gov project. Too many cooks in the kitchen, perhaps?
Many popular news outlets covered the outage, but most pointed to “bugs”, which caused the outage. But when a site dies under load, while it’s working in test & Q/A, that’s a failure of load testing, and capacity planning.
Modern software projects take advantage of continuous integration & agile methods. That is they make small incremental changes. Developers build unit tests, and the code is always in a working state. There is no multi-month dev cycle, where your current software is in doubt.
Hayden James investigated in depth, and found healthcare.gov severely lacking. Again this is a huge failure in techops, sysops or devops. It’s not a bug, and not something the developers are responsible to deliver.
The internet stack is a complex infrastructure of interlocking components. An scalability engineer must be adept at Linux, plus webservers, caching servers, search servers, automation services, and relational databases on the backend. We think a generalist with a broad base of experience is most suited to the job of scalability engineer.
Lots and lots of web applications need to page through information. From customer records, to the albums in your itunes collection. So as web developers and architects, it’s important that we do all this efficiently.
Start by looking at how you’re fetching information from your MySQL database. We’ve outlined three ways to do just that.
Of course such a solution would only work if you were paging by ID. If you page by name, it might get messier as there may be more than one person with the same name. If ID doesn’t work for your application, perhaps returning paged users by USERNAME might work. Those would be unique:
SELECT id, username
WHERE username > ‘email@example.com’
ORDER BY username LIMIT 10;
Paging queries can be slow with SQL as they often involve the OFFSET keyword which instructs the server you only want a subset. However it typically scans collects and then discards those rows first. With deferred join or by maintaining a place or position column you can avoid this, and speedup your database dramatically.
2. Try using a Deferred Join
This is an interesting trick. Suppose you have pages of customers. Each page displays ten customers. The query will use LIMIT to get ten records, and OFFSET to skip all the previous page results. When you get to the 100th page, it’s doing LIMIT 10 OFFSET 990. So the server has to go and read all those records, then discard them.
SELECT id, name, address, phone FROM customers ORDER BY name LIMIT 10 OFFSET 990;
MySQL is first scanning an index then retrieving rows in the table by primary key id. So it’s doing double lookups and so forth. Turns out you can make this faster with a tricky thing called a deferred join.
The inside piece just uses the primary key. An explain plan shows us “using index” which we love!
ORDER BY name
LIMIT 10 OFFSET 990;
Now combine this using an INNER JOIN to get the ten rows and data you want:
SELECT id, name, address, phone
INNER JOIN (
ORDER BY name
LIMIT 10 OFFSET 990)
AS my_results USING(id);
That’s pretty cool!
3. Maintain a Page or Place column
Another way to trick the optimizer from retrieving rows it doesn’t need is to maintain a column for the page, place or position. Yes you need to update that column whenever you (a) INSERT a row (b) DELETE a row ( c) move a row with UPDATE. This could get messy with page, but a straight place or position might work easier.
SELECT id, name, address, phone
WHERE page = 100
ORDER BY name;
All servers use disk to store files. Operating system libraries, webserver & application code, and most importantly databases all use disk constantly.
So disk speed is crucial to server speed.
Disk speed is crucial for MySQL databases. It has been a real challenge in multi-tenant environments like Amazon’s EBS. The provisioned IOPS feature addresses this head on, allowing customers to lock in great MySQL database performance!
Since Amazon is a multi-tenant environment, other customers are using that same network, and hitting those same disks. So if your neighbors are seeing a lot of traffic to disk, your web application can slow down. Not good!
What is Provisioned IOPS
We’ll agree that it’s one of the worst branded features ever, but you should know about it and use it, especially for your MySQL databases.
Provisioned means that you’ll lock in performance in advance, and IOPS stands for I/O operations. Think of it as google juice for your cloud database servers!
Join 19k others and follow Sean Hull on twitter @hullsean.
1. Slow Disk I/O – RAID 5 – Multi-tenant EBS
Disk is the grounding of all your servers, and the base of their performance. True with larger and larger main memory, much is available in cache, a server still needs to constantly read from disk and flush things from memory. So it’s a very very important component to performance and scalability.
What’s wrong with Raid 5?
Raid 5 was designed to give you more space, using fewer disks. It’s often used in a server with few slots or because ops misunderstand how bad it will impact performance. On a database server it can be particularly bad.
All writes see a performance hit. What’s worse is if you lose a disk, the RAID though technically still on line, will perform SO SLOWLY as to be offline. And a rebuild takes many hours. Worse still is the risk to lose another drive during that rebuild. What if you have order a drive and it takes a couple of days?
RAID 10 is the solution. Mirror each set of two drives, then stripe over those. Even with only four slots available, it’s worth it. Good read performance, good write performance, and protection.
What the heck is multi-tentant?
In the cloud, you share servers, network & disk just like you do apartments in a building. Hence the name. Amazon’s EBS or elastic block storage, extends this metaphor, offering you the welcome flexibility of a storage network. But your bottleneck can be fighting with other tenants on that same network.
Default servers do have this problem, however AWS has addressed this serious problem with a little known but VERY VERY useful feature called Provisioned IOPS. It’s a technical name, but means you can lock in reliable disk I/O. Just what the scalability doctor ordered.
MySQL is good at a lot of things, but it’s not ideal for managing application queues. Do you have a table like JOBS in your database, with a status column including values like “queued”, “working”, and “completed”? If so you’re probably using the database to queue work in your application.
It’s not a great use of MySQL because of locking problems that come up, as well as the search and scan to find the next task.
Luckily their are great solutions for developers. RabbitMQ is a great queuing solution, as is Amazon’s SQS solution. What’s more as external services they’re easier to scale.
Scalability becomes key to your business, as you customer base grows. But it doesn’t have to be impossible. Disk I/O, caching, queuing and searching are all key areas where you can make a big dent, in a manageable way. Juggle your technical debt too, and you’re golden!
Oracle has full text search support, why shouldn’t we assume the same in MySQL? Well MySQL *does* have this, but in many versions only with the old MyISAM storage engine. It has it’s set of corruption problems, and isn’t really very performant.
Better to use a proven search solution like Apache Solr. It is built specifically for search, includes excellent library support for developers of most modern web languages and best of all is easy to SCALE! Just add more servers on your network, or distributed globally.
For folks interested in the bleeding edge, Fulltext is coming to Innodb crash safe & transactional storage engine in 5.6. That said you’re still probably better off going with an external solution like Solr or Sphinx and the MySQL Sphinx SE plugin.
Cache, cache, and cache some more. Your webservers should use a solid memcache or other object cache between them & the database. All those little result sets will sit in resident memory, waiting for future web pages that need them.
Next use a page cache such as varnish. This sits in front of the webserver, think of it as a mini-webserver that handles very simple pages, but in a very high speed way. Like a pack of motorbikes riding down an otherwise packed freeway, they speed up your webserver to do more complex work.
Browser caching is also important. But you can’t get at your customers browsers, or can you? Well not directly, but you can instruct them what things to cache. Do that with proper expires headers. Have your system administrator configure apache to support this.
Technical debt can bite. What is it? As you’re developing an new idea, you’ll build prototypes. As those get deployed to customers, change gets harder, and past things you glossed over because problems. One team leaves, and another inherits the application, and the problems multiple. Overtime you’re building your technical debt as your team spends more time supporting old code and fixing bugs, and less time building new features. At some point a rewrite of problem code becomes necessary.
A lot of firms come to us with a specific scalability problem. “Our user base is growing rapidly and the website is falling over!” Or they’re selling more widgets, “Our shopping cart is slowing down and we’re seeing users abandon their purchases”. These are real startup growing pains, so what to do?
We like to take a measured approach with these types of challenges, so we thought it would be helpful to run through a hypothetical scenario and see how we work.
First we talk on the phone, or meet face to face and discuss what’s happening. Do you have one page that’s problematic? Is the website slow during certain hours? Or are you seeing erratic behavior and can’t point to a single source?
From there we outline a course of action, based on:
o talking with team, devs & architects
o reviewing systems first hand
o identifying bottlenecks and trouble spots
This with this outline we’ll include an estimate of the number of work days it’ll take to complete. We’ll then send that back to you for review, exchange a deposit and set a start date.
2. Meet team & discuss architecture
Next we’ll meet the team and review the problems in more technical detail. If you’re in NYC we’ll probably make a stop into your offices and have a warm meet & greet. If you’re located further afield we can either meet over a skype call, or arrange for us to travel to your location for the start of the engagement.
3. Measure current throughput
In order to get a sense of the current state of the systems we’ll measure some system metrics. This could be load average or queries per second or other MySQL internal metrics. We’ll also look at some business metrics such as speed of an ecommerce checkout, or a speed test on a particularly slow page.
These metrics are designed to create a baseline of where things are before any changes are made.
[quote]Measuring both business and system metrics before and after changes, allow a rough ROI measurement to be done. This goes a long way towards justifying the expense of a performance review, current and future.[/quote]
4. Review systems, configurations & setups
Next we’ll jump on the various systems and review configurations. This includes webservers, caching servers and the database servers as necessary. We’ll review memory settings, important configurations, all the dials and switches.
Along with this we’ll also review development and architecture. Are you using Java with Hibernate a popular ORM? Or perhaps CakePHP? Are you writing custom SQL code? Are developers up to speed with EXPLAIN and query profiling? For that matter is code in version control?
Perhaps the most essential and useful part of an initial engagement is our overall findings and review report. We’ve found these are very valuable to firms as they speak to a lot of folks up and down the business hierarchy. They speak to management about high level architectural problems and structural or process related challenges. And they can speak well to developers and operations teams as they provide a third party birds eye view of day-to-day activities.
From here we’ll meet again. In particular we’ll review the actionable advice. Some changes will be low cost, requiring no downtime, while others might require a downtime window. Further medium term changes might require refactoring some code and deploying. Typically the larger longer term architecture changes will also be outlined.
Based on time & costs, we’ll decide together which changes are a priority. Obviously we’ll want to move on low hanging fruit first, and move forward from there.
Once we’ve decided which changes we’ll make, we’ll schedule downtime windows as needed and make the changes to systems. From there we’ll carefully observe everything for stability, and no adverse affects.
8. Measure throughput again
Based on the throughput measurements in #3 above, we’ll perform those same benchmarks again. We’ll check low level system metrics, along with higher level business & user based throughput. Both of these are important as they can provide different perspectives on changes made.
For example if the system metrics improve markedly, but the business or user metrics do not, we know are change had some affect on overall performance, but likely we did not identify the one which directly is causing the business slowdown.
9. Summarize findings & performance gain
In the most likely case they both improve markedly, and we can measure the improvements from our entire process of performance review.
This can be helpful and measuring overall return on investment for the engagement. ROI is obviously an important exercise as we want to know that the money is well spent.
10. Document solutions & recommendations
The last step is to document what we did and what we learned. This allows us to carry forward that knowledge and keep applying it to the development and operations process. This allows the business to continue adding value from the engagement even after it’s completed.