Is Redshift Outpacing Hadoop as The Big Data Warehouse for Startups?

More and more startups are looking at Redshift as a cheaper & faster solution for big data & analytics.

Saggi Neumann posted a pretty good side-by-side comparison of Redshift & Hadoop and concluded they were tied based on your individual use case.

Meanwhile, Bitly Engineering concluded Redshift was much easier.

Why Should You Choose Redshift Outpacing Hadoop as The Big Data Warehouse for Startups?

Here are the reasons why Redshift is considered much easier for startups:

1. More Agile

One thing pointed out by the bitly blog post, which I’ve seen countless times, is the slow iteration cycle. Write your map-reduce job, run, test, debug, then run on your cluster. Wait for it to return and you might feel like you’re submitting a stack of punched cards. LOL Resolve the errors that come back and then rerun on your cluster. Over & over & over again.

With Redshift you’re writing SQL, so you’re iterating through syntax errors quickly. What’s more, since Redshift is a column-compressed database, you can do full table scans on columns without indexes.

What that means for you and me is that queries just run. And they run blazingly fast!

Also: When hosting data on Amazon turns bloodsport

2. Cheap

Redshift is pretty darn cheap.

Saggi’s article above quotes Redshift at $1000/TB/yr for reserved, and $3700/TB/yr for on-demand. This is compared with a Hadoop cluster at $5000/TB/yr.

But neither will come with the spitting distance of the old world of Oracle, where customers host big iron servers in their own data center, paying north of a million dollars between hardware & license costs. Amazon Cloud FTW!

Related: Did Dropbox have to fail?

3. Even Faster

Airbnb’s nerd’s blog has a post showing it costing 25% of a Hadoop cluster and getting a 5x performance boost. That’s pretty darn impressive!

Flydata has done benchmarks showing a 10x speedup.

Read: Are SQL Databases Dead?

4. SQL Toolchains

Also: 5 core pieces of the Amazon cloud puzzle to get your project off the ground

5. Limitations

O Data Loading

You load data into Redshift using the COPY command. This command reads flat files from S3 and dumps them into tables. It can be extremely fast if you do things in parallel. However, getting your data into those flat files is up to you.

There are a few solutions to this.

Amazon Data Pipeline

This is Amazon’s own toolchain, which allows you to move data from RDS & other Amazon-hosted data sources. The data pipeline does not move data in real-time but in batch. Also, it doesn’t take care of schema changes so you have to do that manually.

I mentioned it in my 5 reasons to move data to Amazon Redshift

Flydata Service

Flydata is a service with a monthly subscription that will connect to your RDS database, and move the data into Redshift. This seems like a no-brainer, and given the heft price tag of thousands per month, you’d expect it to cover your bases.

In my experience, there are a lot of problems & it still requires a lot of administration. When schema changes happen, those have to be carefully applied on Redshift. What’s more, there’s no silver bullet around the datatype differences.

Also: Some thoughts on 12-factor apps

Flydata also makes use of the binary logs to replicate your data. Anything that doesn’t show up in the binary logs is going to cause you trouble. That includes when you do sql_log_bin=0 in the session, an SQL statement includes a no logging hint. Also, watch out for replicate-ignore-db options in your my.cnf.

But it also will fail if you use ON DELETE CASCADE. That’s because these downstream changes happen via Constraints in MySQL. But… drumroll please, Redshift doesn’t support ON DELETE CASCADE. In our case the child tables ended up with extra rows, and some queries broke.

Scripts Such as Donors Choose Loader

Donors Choose has open-sourced their nightly Redshift loader script. It appears to reload all data each night. This will nicely sidestep the ON DELETE CASCADE problem. As you grow though you may quickly hit a point where you can’t load the entire data set each night.

Their script sources are from Postgres, though I’m interested to see if it can be modified for MySQL RDS.

Tried & Failed with Tungsten Replicator

Theoretically, a Tungsten replicator can do the above. What’s more, it seems like a tool custom-made for such a use case. I tried for over a month to troubleshoot. I worked closely with the team to iron out bugs. I wrote Wrestling with Bears or How I Tamed Tungsten Replicator for MySQL and then I wrote a second article Tungsten Replicator the Good the Bad & the Ugly. Ultimately, I did get some data moving between MySQL RDS & Redshift, however, certain data clogged the system & it wouldn’t work for any length of time.

Also: Secrets of a happy Amazon hacker or how to lock down your account with IAM and multi-factor authentication

Data Types & Character Sets

There are a few things here to keep in mind. Redshift counts bytes, so if in MySQL or some other database you had a varchar(5) it may be varchar(20) in Redshift. Even then I had cases where it still didn’t fit & I had to make the field bigger by 4.

I also ran into problems around string character encodings. According to the docs Redshift handles 4-byte UTF-8.

Redshift doesn’t support ARRAYs, BIT, BYTEA, DATE/TIME, ENUM, JSON, and a bunch of others. So don’t go into it expecting full Postgres support.

What you will get are multibyte characters, numeric, characters, DateTime, boolean, and some type conversion.

Also: Is the difference between dev & ops a four-letter word?

O Rebalancing

If and when you want to add nodes, expect some downtime. Yes theoretically the database is online while it’s shipping data to the new nodes & redistributing things, the latency can start to feel like an outage. What’s more, it can easily push into the hours to do.

Conclusion

In the ongoing debate about whether Amazon Redshift is outpacing Hadoop as the preferred big data warehouse for startups, the answer largely depends on individual requirements and priorities. However, Hadoop’s customizability and comprehensive ecosystem still hold value for certain use cases. We hope you’ve found this guide helpful in learning about them.