Is data your dirty little secret?

data comparison cloud

While I was fumbling for the dictionary to figure out what polyglot persistence was, the CTO had decided to build a warehouse on Redshift.

“Everybody’s moving data there.” He declared. I looked on quizzically.

Join 32,000 others and follow Sean Hull on twitter @hullsean.

“That’s a very new database engine”, I chimed in. “Lets do some testing”.

And there began an adventurous ride into the bleeding edge of Amazon’s new data service offerings!

1. The data scientist comes crying

Are our transactional database we were using Amazon RDS for MySQL. It’s a great managed service that eliminates some of the headaches of doing it yourself. I wrote about thisRDS or MySQL Use Cases.

We needed some way to get data over to Redshift. We evaluated AWS Data Pipeline, but it wasn’t realtime enough. In a pinch we decided on a service called Flydata. After weeks of effort to get it setup & administered we had it running smoothly.

I since discovered some pipelining solutions dedicated to Redshift such as Alooma – modern data plumbing, RJMetrics pipeline and Domo. I *did not* manage to get Tungsten working. It supports redshift on paper, but has a lot of growing up to do.

Until one data when the data scientist shows up at my desk. “We have problems in our data on redshift.”. I look back confused. “Are you sure? Can you tell me where you’re seeing that?” I respond.

Also: When hosting data on Amazon turns bloodsport

2. Deleted data reappears in Redshift!

He sends me over some queries, that I rerun myself. I see extra data in Redshift too, data that had been deleted in MySQL. Strange. We dig deeper together trying to figure out what’s happening.

We find that the tables with extra data are child tables of a parent where data was deleted. Imagine Citibank deletes a customer, they also want to delete the records for monthly bills. Otherwise those will just be hanging around, and won’t match up anymore with a parent record for a customer. In real life Citibank probably doesn’t delete like this but it’s a helpful example.

The first thing I do is open a ticket with Flydata. After all we hadn’t gotten any errors logged. Things *must* be running correctly.

After highlighting the severity of the issue, we setup a conference call with Flydata. Digging further they discover the problem. Child table data can’t get deleted on Redshift, because it doesn’t support ON DELETE CASCADE. Wait what?

Turns out Flydata makes use of the MySQL transaction log to move data. In mysql to mysql replication this works fine because downstream you also have MySQL. It also implements on delete cascade so those child records will get cleaned up correctly. Since Redshift doesn’t have this, there’s no way for Flydata to instruct Redshift what to do. Again I said, wait what?

My surprise wasn’t that a new unproven technology like Redshift had a lot of holes & missing features. My surprise was that Flydata was just silently ignoring the problem. No logged messages to tell the customer about inconsistencies. At least?

Related: Is Amazon too big to fail?

3. The problem – comparing data

As you might imagine, this is a terrible way to find out about data problems. As the person tasked with moving data between these systems, eyes were on me. My thought was, we chose a service-based solution, so they’re manage data movement. If there’s a problem, they’ll surely alert us.

From there the conversation became, ok, how do we figure out where all these data differences are? Is it widespread or isolated to a few tables? Can we mitigate it with changed queries? Cleanup on a daily basis? These are some questions that’ll immediately come to mind.

To answer them we needed a way to compare table data across different databases. This is hard to do within a homogenous environment where server versions & datatypes are likely to be the same. It is much more complicated when you’re working across heterogenous systems.

Read: 5 Reasons to move data to amazon redshift

4. Build some way to spot check data

Although this still doesn’t seem a solved problem, there are some tools. One way is to perform checksums on tables & rows. These can then be compared to find differences.

This drew me to find Jason Friedman’s
table hash script on Github. It can work across MySQL, Postgres & redshift. Pretty cool stuff if you ask me.

One problem remains. Databases are always in flux. As such you may find discrepancies based on data that hasn’t been moved yet. Data that’s just changed in the last few minutes.

If you refresh data nightly, you may for example be able to stop a slave to compare data at an instant in time.

Also: Is Redshift outpacing hadoop as the warehouse for startups?

5. The mentality: treat data as a product & monitor

Solving tough problems like these is a work in progress. What it taught me is that:

You should own your data pipeline

This allows you to be vigilant about monitoring, throw errors if data is different, and ultimately treat data as a product. Owning the pipeline will mean you can build monitoring around stages, and automate spot checks on your data.

You won’t get it perfect, but you want to know when it isn’t.

Also: 5 core pieces of the Amazon puzzle to get your project off the ground

Get more. Grab our exclusive monthly Scalable Startups. We share tips and special content. Our latest Why I don’t work with recruiters

  • Alooma

    Thanks for the mention – we are here for any questions :)

    • Sean Hull

      You got it. Do you guys do any webinars or have youtube lectures online?