Why does MySQL replication fail?

When considering active-active multi-master, you must consider it’s foundation technology. Although MySQL replication is straightforward to setup, it can fail in a myriad of ways. Most of those are known and well understood. We can solve them only if we use the technology in the standard way.

Click through to the end for multi-master solutions that work with MySQL.

Reason 2 – Replication is brittle to start with

- statement based – throw it over the fence architecture
- use non-deterministic functions causes unpredictable behavior
- no globally unique transaction id (like Oracle SCN – system change number)
- no built in checksums – pt-table-checksum is a must
- replication position syncing to index files prone to breaking
- temp tables disappear after restart
- row-based still a new code path – doesn’t support zero downtime
- even with row-based mysql can fall back to statement
- row-based does not include SQL in binlogs
- MySQL replica slaves die frequently, require re-clone from authoritative master

NEXT: Reason 3 – MySQL row-based replication has limitations

PREV: Reason 1 – auto increment settings create new problems

Want more? Grab our Scalable Startups monthly for more tips and special content. Here’s a sample

  • Pingback: 10 reaons active-active is hard and how to solve it

  • http://www.iheavy.com/blog/ Sean Hull

    Friend & colleague Morgan Tocker had some very good commentary. Thx Morgan!


    Hi Sean,

    I wanted to write to you personally so I don’t seem negative in comments. I’d love to see some MySQL 5.6 mentions :D Some of these features get a lot better.

    RE: http://www.iheavy.com/2013/04/16/why-does-mysql-replication-fail/

    * No globally unique transaction id (like Oracle SCN – system change number)
    –> There is from MySQL 5.6.

    * No built in checksums – pt-table-checksum is a must
    –> There is from 5.6, but pt-table-checksum solves a different need and is still required. You could change this point to “no built in consistency checking”.

    * temp tables disappear after restart
    –> This applies only to statement based. Row based doesn’t replicate temp tables. And you are correct, if you modify data via temp tables – don’t use statement based.

    * row-based still a new code path – doesn’t support zero downtime
    –> You might mean GTIDs? Row-based is a 5.1 feature, it’s very well supported: http://www.tocker.ca/2013/09/04/row-based-replication.html

    * even with row-based mysql can fall back to statement
    –> This is the case for DDL, but it’s fairly deterministic in how it works. I don’t recommend MIXED myself for this reason, because the behaviors are different (ROW: triggers execute on master, STATEMENT: triggers execute on slave).

    * row-based does not include SQL in binlogs
    –> Available in 5.6 as informational events.

    * MySQL replica slaves die frequently, require re-clone from authoritative master
    –> Fixed in 5.6. Slaves can be made crash-safe (Facebook runs crash safe for example, I also recommend it: http://www.tocker.ca/2013/06/19/deciding-whether-or-not-to-make-mysql-durable.html )

    RE: http://www.iheavy.com/2013/04/16/limitations-of-mysql-row-based-replication/

    * doesn’t write SQL to the binlogs – useful for troubleshooting
    –> Mentioned from previous post, can in 5.6.

    * harder to do point-in-time recovery without SQL in binlogs
    –> Will actually be better in 5.7 with idempotent replay: http://geek.rohitkalhans.com/2013/05/mysqlbinlog-idemmpotent-mode.html

    * harder to do online schema changes by switching masters
    –> The binlog_format is a session variable, pt-online-schema-change changes to STATEMENT itself for this to work. The specific issue is “not possible to do online schema change with a tiered replication architecture” since the slave in the middle breaks things.

    - Morgan