Five ways to get your data into Redshift

redshift data pipeline

Everybody is hot under the collar this data over Redshift. I heard one customer say, a query that took 10 Hours before now finishes in under a minute. Without modification. When businesses see 600 times speedup, that can change the way they do business.

Join 32,000 others and follow Sean Hull on twitter @hullsean.

What’s more Redshift is easy to deploy. No complicated licenses like the Oracle days. No hardware, just create your cluster & go.

So you’ve made the decision, and you have data in your transactional database, MySQL RDS or Postgres. Now what?

Here are some systems that will help you synchronize data on the regular. And keep it in sync. Most of these are near real-time, so you can expect reports to be looking at the data your business created today.

1. RJ Metrics Pipeline

One of the simplest options, RJ Metrics Pipeline. Setup a trial account, configure your Redshift credentials in the warehouse section (port, user, password, endpoint) and save. Then configure your data source. For MySQL specify hostname, user, password & port. You get the option to go through an ssh tunnel for security. That’s good. You’ll also be given the grant code to create a user in MySQL for RJM.

rjmetrics table config screen

RJM uses a primary or unique key to figure out which rows have changed. Well that’s not completely true. Only if you’re using incremental refresh. If you’re using complete refresh, then it just selects all the data & replaces it each time.

The user interface is a bit clunky. You have to go in and CONFIGURE EACH TABLE you want to replicate. There’s no REPLICATE-ALL option. This is a pain. If you have 500 tables, it might take hours to configure them all.

Also since RJM isn’t CDC (change data capture) based, it won’t be as close to real-time as some of the other options.

Still RJM works and it’s pretty point-n-click.

Also: Is Amazon too big to fail?

2. xplenty

xplenty is really a lot more than just a sync tool. It’s a full featured ETL system. Want to avoid writing tons of python jobs to convert datatypes, transform 0 to paid & 1 to free, things like that? Well xplenty is made to allow building ETL systems without code.

xplenty main dashboard

It’s a bit complex to setup at first, but very full featured. It is the DIY developer or DBAs tool of the bunch. If you need hardcore functionality, xplenty seems to have it.

Also: When hosting data on Amazon turns bloodsport?

Is Data your dirty little secret?

3. Alooma

Alooma might possibly be the most interesting of the bunch.

After a few stumbles during the setup process, we managed to get this up and running smoothly. Again as with xplenty & Fivetran, it uses CDC to grab changes from the MySQL binlogs. That means you get near realtime.

alooma dashboard

Although it’s a bit more complex to setup than Fivetran, it gives you a lot more. There’s excellent visibility around data errors, which you *will* have. Knowing where they happen, means your data team can be very proactive. This is great for the business.

What’s more there is a python based Code Engine which allows you to write bits of code that transform data in the pipeline. That’s huge! Want to do some simple ETL, this is a way to do that. Also you can send notifications, or requeue events. All this means you get state of the art pipeline, with good configurability & logging.

Read: Is aws a patient that needs constant medication?

4. Fivetran

Fivetran is super point-n-click. It is CDC based like Flydata & Alooma, so you’re gonna get near realtime sync with low overhead. It monitors your binlogs for changed data, and ships it to Redshift. No mess.

The dashboard is simple, the setup is trivial, and it just seems to work. Least pain, best bang.

Related: Does Amazon eat it’s own dogfood?

5. Other options

There are lots of other ways to get data into Redshift.

Flydata

I did manage to get Flydata working at a customer last year. It’s a very viable option. I wrote at length about that solution I’ll leave you to read all about it there.

AWS Data Pipeline

I’ve started to kick the tires of AWS Data Pipeline but haven’t decided if it’s the best option for customers.

Nightly rebuild

The Donors Choose Tech Blog posted about their project which can move data from postgres to redshift. You can find the project here.

This will do a *full* reload each night, so if your db is too big for that, it might need modifications. Also if you’re using MySQL as source db you’ll need to change code. One thing I found in there was Perl & Sed commands to transform your source schema CREATE & ALTER statements into Redshift compatible ones. That in itself is worth a look.

Lambda to the rescue

The awslabs github team has put together a lambda-based redshift loader. This might be just what you need. Remember thought that’ll you’ll need to deliver your source data in CSV files to S3 on the regular. So you’ll need some method to dump it. But if you have that half of the equation, this is ideal.

Data Migration Serve or DMS

This appears to have supported Redshift early on, but does not appear to do so now. I’ve gotten conflicting reports, so I should dig a bit more. Anybody want to comment on this one?

Tungsten

I tried & tried & tried to get Tungsten to work. I did have some success but was still blocked by data problems which remained unresolved. To my mind the project is still broken or at least very buggy.

Also: Is AWS too complex for small dev teams?

Get more. Grab our exclusive monthly Scalable Startups. We share tips and special content. Our latest Why I don’t work with recruiters


Also published on Medium.

  • Akshay Iyengar

    Do you have any thoughts about SymmetricDS?

    • http://www.iheavy.com/blog/ Sean Hull

      I haven’t used it Akshay, but I just added it to my list. I’ll have to try it out.

      • Akshay Iyengar

        Sounds good. Thanks Sean! We’re looking to replicate MySQL data into Redshift. It is likely to be incremental, and we discovered that Data Pipeline might not really be the best tool for incrementals. Most other tools I’ve seen and the ones mentioned here are not free. SymmetricDS is open source, but looks like it needs a lot of config. Was just wondering if you had used it or heard of it. Thanks for the useful article, too!

        • http://www.iheavy.com/blog/ Sean Hull

          I tried to do it with Tungsten. Had a horrible time. Wrote all about it:
          http://www.iheavy.com/2015/06/09/replicate-mysql-to-amazon-redshift-with-tungsten-the-good-the-bad-the-ugly/

          You mention cost. Be careful when weighing service or license costs too heavily. The alternative open source or *FREE* solution comes with care & feeding costs that are very real. It means an engineer needs to be tasked with that, and can’t apply that time to other things. A real cost to the business that should be accounted for. To a business that has better things to do, going with a service like flydata, alooma or xplenty can reduce the headache so you can get on with other things.

          That said there are still really good reasons to go with an open source tool. Visibility & transparency, logging and overall end-to-end control. To me these are important, and might sway me towards something like SymmetricDS if it works as advertised.

          • Akshay Iyengar

            Yup! I’d tried to use Tungsten for a different purpose a year or so ago. Just getting it install and work right was a monumental effort. For our use case, we don’t really need real time binlog replication. It’s going to be an incremental every few months. Minimal maintenance and minimal costs are really the keys here.

          • http://www.iheavy.com/blog/ Sean Hull

            Let’s connect offline. (hullsean@gmail) I’d like to share notes because I plan to install symmetricDS myself.