Did MySQL & Mongo have a beautiful baby called Aurora?

amazon aurora slide

Amazon recently announced RDS Aurora a new addition to their database as a service offerings.

Here’s Mark Callaghan’s take on what’s happening under the hood and thoughts from Fusheng Han.

Amazon is uniquely positioned with RDS to take on offerings like Clustrix. So it’s definitely worth reading Dave Anselmi’s take on Aurora.

Join 28,000 others and follow Sean Hull on twitter @hullsean.

1. Big availability gains

One of the big improvements that Aurora seems to offer is around availability. You can replicate with aurora, or alternatively with MySQL binlog type replication as well. They’re also duplicating data two times in three different availability zones for six copies of data.

All this is done over their SSD storage network which means it’ll be very fast indeed.

Read: What’s best RDS or MySQL? 10 Use Cases

2. SSD means 5x faster

The Amazon RDS Aurora FAQ claims it’ll be 5x faster than equivalent hardware, but making use of it’s proprietary SSD storage network. This will be a welcome feature to anyone already running on MySQL or MySQL for RDS.

Also: Is MySQL talent in short supply?

3. Failover automation

Unplanned failover takes just a few minutes. Here customers will really be benefiting from the automation that Amazon has built around this process. Existing customers can do all of this of course, but typically require operations teams to anticipate & script the necessary steps.

Related: Will Oracle Kill MySQL?

4. Incremental backups & recovery

The new Aurora supports incremental backups & point-in-time recovery. This is traditionally a fairly manual process. In my experience MySQL customers are either unaware of the feature, or not interested in using it due to complexity. Restore last nights backup and we avoid the hassle.

I predict automation around this will be a big win for customers.

Check out: Are SQL Databases dead?

5. Warm restarts

RDS Aurora separates the buffer cache from the MySQL process. Amazon has probably accomplished this by some recoding of the stock MySQL kernel. What that means is this cache can survive a restart. Your database will then start with a warm cache, avoiding any service brownout.

I would expect this is a feature that looks great on paper, but one customers will rarely benefit from.

See also: The Myth of Five Nines – Is high availability overrated?

Unanswered questions

The FAQ says point-in-time recovery up to the last five minutes. What happens to data in those five minutes?

Presumably aurora duplication & read-replicas provide this additional protection.

If Amazon implemented Aurora as a new storage engine, doesn’t that mean new code?

As with anything your mileage may vary, but Innodb has been in the wild for many years. It is widely deployed, and thus tested in a variety of environments. Aurora may be a very new experiment.

Will real-world customers actually see 500% speedup?

Again your mileage may vary. Lets wait & see!

Related: 5 Things toxic to scalability

Get more. Grab our exclusive monthly Scalable Startups. We share tips and special content. Our latest Why I don’t work with recruiters

  • Anurag@AWS

    For point in time recovery, we are talking about restoration from a backup made to S3. We try to ensure that backups to S3 are within 5 minutes of the current point in time. If you have to restore to a new volume, there is potential data loss. This exists with any secondary backup solution as there is a time lag from writes to primary storage.

    Our primary model for handling storage failures is to have quorum replication of all data within the database volume itself. Within the volume, we replicate writes to a 4/6 quorum with two copies of data in each of three AZs. For a primary storage failure, you’d need to lose three copies of data (inside the time window it takes us to repair it) to lose write availability and four copies of data to lose read availability. Our design point was to make this vanishingly rare for physical media failures.

    On the question of whether cache survivability is valuable, I guess the only thing I can say is that we implemented it based on feedback from our customers on the pain they have dealt with when having to restart the database for machine restarts or application deadlocks. Like most availability and durability features, it only matters when you need it. We take the position that reducing mean-time-to-failure and mean-time-to-recovery are both important.

    • http://www.iheavy.com/blog/ Sean Hull

      Thx Anurag. When you say “database volume itself” do you mean at the storage layer, like DRBD or something similar?

      • Anurag@AWS

        By database volume, I mean the persistent storage associated with the database. We issue 6 async writes simultaneously and ack upwards when we receive 4 acks. this helps with both jitter and availability. There is a lot that makes sense with chain replication techniques – most notably, use of bandwidth – but my personal view is that they amplify jitter. We’re very focused on IO jitter since it tends to get directly reflected back into sql execution time, which in turn often get reflected into user experience.

        • http://www.iheavy.com/blog/ Sean Hull

          Agreed, predictable disk I/O is *VERY* key to database performance. Even ones where entire dataset can fit in memory, joins & sorts can quickly fill it up. Plus the redo logging.

          With a traditional server, there aren’t all these multi-tenant challenges. It’s the first time I heard this term “IO Jitter” but it makes perfect sense, Anurag. Thx for comments.

          • Anurag@AWS

            It’s not so much about multi-tenancy. It is more about the need to put persistent state across a network so it remains available in the case of a fault on the database server. People do this on-premise as well using network storage arrays. And, you also don’t want all the disks to sit on one storage box since it also can go down. Once you distribute data for fault-tolerance, network and storage node jitter happens so you want mitigation strategies.

            You’re right that they don’t exist (to the same degree) on a single box, but then you end up with all the issues around the blast radius of a failure…

          • http://www.iheavy.com/blog/ Sean Hull

            Indeed.

    • http://www.iheavy.com/blog/ Sean Hull

      And yes I agree warm restarts increase mean-time-to-recovery for sure.

    • Matt Hurne

      What do you mean by “mean-time-to-failure”? Wouldn’t “reducing mean-time-to-failure” be undesirable?

      • Anurag@AWS

        sorry, stated stupidly by me. we want to maximize MTTF and minimize MTTR. What I mean by MTTF is the time gap between failures.

        In the example above, we are heavily replicating within Aurora to reduce the need to go to S3 to restore from backup. And, the survivable cache is a way to reduce MTTR on a restart (as is the removal of the need for crash recovery itself). For different faults, in the examples above to be sure – I was just responding to the thought that MTTR doesn’t matter just because failures are rare.

        • http://www.iheavy.com/blog/ Sean Hull

          understood.

        • Matt Hurne

          Yes, thanks for the clarification!

  • Небојша Камбер

    I don’t get it, how does this article relate to Mongo? :-/

    • http://www.iheavy.com/blog/ Sean Hull

      perhaps artistic license, euphemism or colorful metaphor?