Is Zero downtime even possible on RDS?

amazon rds mysql

Join 36,000 others and follow Sean Hull on twitter @hullsean.

Oh RDS, you offer such promise, but damn it if the devil isn’t always buried in the details.

Diving into a recent project, I’ve been looking at upgrading RDS MySQL. Major MySQL upgrades can be a bit messy. Since the entire engine is rebuilt, queries performance can change, syntax can break, and surely triggers & stored procedures can have problems.

That’s not even getting into it with storage engines. Still have some tables on MyISAM? Beware.

The conclusion I would make is if you want zero downtime, or even nearly zero, you’re going to want to roll your own MySQL on EC2 instances.

Read: Why high availability is so very hard to deliver

1. How long did that upgrade take?

First thing I set out to do was upgrade a test instance. One of the first questions my client asked, how long did that take? “Ummm… you know I can’t tell you clearly.” For an engineer this is the worst feeling. We live & die by finding answers. When your hands are tied, you really can’t say what’s going on behind the curtain.

While I’m sitting at the web dashboard, I feel like I’m trying to pickup a needle with thick leather gloves. Nothing to grasp here. At one point the dashboard was still spinning, and I was curious what was happening. I logged out and back in again, and found the entire upgrade step had already completed. I think that added five minutes to perceived downtime.

Sure I can look at the RDS instance log, and tell you when RDS logged various events. But when did the machine go offline, and when did it return for users? That’s a harder question to answer.

Without command line, I can’t monitor the process carefully, and minimize downtime. I can only give you a broad brush idea of what’s happening.

Also: RDS or MySQL 10 use cases

2. Did we need to restart the instance?

RDS insists on rebooting the instance itself, everytime it performs a “Modify” operations. Often restarting the MySQL process would have been enough! This is like hunting squirrels with a bazooka. Definitely overkill.

As a DBA, it’s frustrating to watch the minutes spin by while your hands are tied. At some point I’m starting to wonder… Why am I even here?

Related: Howto automate MySQL slow query analysis with Amazon RDS

3. EBS Snapshots are blunt instruments

RDS provides some protection against a failed upgrade. The process will automatically snapshot your volume before it begins. That’s great. If I spend

See also: Is Amazon RDS hard to manage

4. Even promoting a read-replica sucks

I also evaluated using a read-replica. Here you spinup a slave first. You then upgrade *THAT* box to 5.6 ahead of your master. While your master is still sending data to the slave, your downtime would in theory be very minimal. Put master in read-only mode, wait few seconds for slave to catchup and switch application to point to slave, then promote it!

All that would work well with command line, as your instances don’t restart. But with RDS, it takes over seven long minutes!

Read this: 5 Reasons to move data to Amazon Redshift

5. RDS can upgrade to MySQL 5.6!

MySQL 5.6 introduced a new timestamp datatype which allows for fractional seconds. Great feature, but it means the on-disk datastructures are different. Uh oh!

If you’re doing replication with MySQL 5.5 to 5.6 it will break because the rows will flow out in one size, and break the 5.6 formatted datafiles! Not good.

The solution requires running ALTER commands run on the master beforehand. That in turn locks up tables. So it turns out promoting a read-replica is a non-starter for 5.5 to 5.6. Doesn’t really save much.

All of this devil in the details stuff is terrible when you don’t have command line access.

Read: Are SQL databases dead?

Get more. Grab our exclusive monthly Scalable Startups. We share tips and special content. Our latest Why I don’t work with recruiters

Also published on Medium.

  • Justin Swanhart

    There is no such thing as zero downtime. You might as well be a unicorn hunter.

    • Sean Hull

      hehe. Some of my best friends are unicorns, Justin!

    • Ovais Tariq

      Right on, there is no such thing as 0-downtime, RDS or no RDS.

      • Sean Hull

        Yep, that’s for sure.

        I guess what I was getting at here is RDS makes it a lot harder to reduce downtime. It just creates more downtime for many ordinary maintenance tasks .

        • Ovais Tariq

          I agree with you on that. I have not really been a fan of RDS for many different reasons. But its a good fit for small shops.

          • Sean Hull

            well said.

        • MySQL dummy

          Can you provide a bit more detail about “ordinary maintenance tasks” and how they are slow on RDS? A major version upgrade doesn’t seem to be “ordinary maintenance”, IMO.

          • Sean Hull

            Hi MD,

            One example, “promoting a slave”. This is aws terminology for stopping replication. When you do this, you have a standalone instance hence the name. Stopping a slave & reset slave take about 2 seconds total. AWS will restart the instance. From the dashboard the process typically takes 5-6 minutes. 180x slower.

            Another example, changing from EBS to SSD. This can be done without a reboot too. AWS reboots the instance at both ends of the process, adding 10-12minutes downtime.

            Another example, creating a slave. AWS does this using it’s native EBS snapshot technology. You have a bit of semi-downtime, as disk I/O slows to a crawl. With your own MySQL instance, you use command line and a tool called xtrabackup from Percona. That allows you to create a hotbackup of the database, with no noticeable downtime at all.


    • Dmitriy Royzenberg

      We are using the following approach to achieve zero-downtime on RDS

  • Someone

    Having worked intensely with RDS for quite a while now I completely agree, except there are technically a couple things that don’t require a restart and the 5.6 upgrade path you describe only applies to instances created before about May 2014.

    Still, if you know anything about running MySQL then roll your own.

    • Sean Hull

      Thx. Completely agree. Trouble is a lot of startups are starting to choose RDS thinking it’s an *easier* solution.

    • Sean Hull

      Yep. These were older RDS instances that encountered this problem.

  • Dmitriy Royzenberg

    We are using Master/Master setup on RDS that allows us to have zero downtime maintenances. Here is details if you are interested.

    • Sean Hull

      Great post Dmitriy. Impressive. You found a very creative way to do zero downtime on RDS. I commented at the bottom of the medium post.

      My thinking is when we start pushing the boundaries of a technology, using it outside of a supported model, we increase risk. In the case of RDS, it’s an indication of exactly when you should start using roll-your-own MySQL.

      But as they say your mileage may vary. 🙂

      • Dmitriy Royzenberg


        Thank you for your feedback and paying attention to details! I will try to address your concerns.

        First, I ABSOLUTELY agree with you that it is far from ideal to have an unsupported configuration, However, as with everything else you need to weigh-in pro and cons. We are not running M/M configuration on regular bases but only occasionally, a few times a year, when either AWS announces maintenance or we need to perform one (e.g: increase instance type without downtime).

        The M/M is launched literary for a few hours so we are not concerned here supporting it for a long run. Besides, the MySQL M/M (Passive/Active) is a proved solution that numerous companies run on daily bases including Facebook and alike. The solution gives us definitely much more pros of having RDS motorcycle that already built by “revolutionary” AWS RDS team (see how they changed MySQL with Aurora!) comparing to building our own bicycle with the limited DBA resources. All you need is to know how to navigate the AWS limitations which essentially places us in control.

        Second, regarding auto_icnrement_offset, this is a standard configuration for M/M to prevent PK conflicts and it works fine as long as you know what you are doing. We ran M/M for years on Percona servers and as long as it used in the Active/Passive mode you are on the safe ground. Otherwise, you need to consider Percona or MariaDB Clusters that are using Galera replication conflicts resolution which I don’t believe will work with RDS.

        And finally, third, since we are using M/M just for few hours we are not configuring it for read-only access as during traffic switchover both M1 and M2 will be Active and taking traffic for a few minutes as explained in the article. Normally, you enable read-only mode if you have DBA messing up with the system for a long run to avoid accidental user error as you stated it may happen “eventually”.

        Here the process is automated and configuration is short-lived specifically designed to perform maintenance without downtime so we can take advantage of the full scope of the RDS features.

        I hope I addressed some of your concerns. Please let me know if you have any other questions!

        • Sean Hull

          Cool stuff Dmitriy! Perhaps I’ll give the method a try and see how it goes.