Category Archives: Data

Is there a new better way to build a data warehouse in 2016?

redshift warehouse

In the old days… the bygone days of 2005 🙂 That was when you’d pony up for an Oracle license, get the hardware, and build your warehouse. Somewhere along the way you crossed your fingers.

Join 32,000 others and follow Sean Hull on twitter @hullsean.

Today everybody wants to treat data as a product. And for good reason. Knowing how to better server your customers & iterate more quickly is essential in todays hypercompetitive startup world.

1. Amazon Redshift enters the fray

Recently I’ve been wondering why is everyone suddenly talking about Amazon Redshift?? I ask not because recruiters are experts at database technology & predicting the industry trends, but rather because they have their finger on the pulse of what firms are doing.

Amazon launched Redshift in early 2013 using ParAccel technology. Adoption has been quick. Customers who already have their data in the AWS ecosystem find the offering a perfect match for their data analytics needs. And with stories swirling around of 10 hour MySQL reports running in under 60 seconds on Redshift, it’s no wonder.

Also: Is AWS too complex for small dev teams?

2. Old method – select carefully

Ralph Kimball’s opus having fully digested, you set out to meet with stakeholders, and figure out what you were building.

Of course no one understood your questions, and business units & engineering teams spoke english & french. Months went by, and things devolved. Morale got squashed. Eventually out the other end something would be built, nobody would be happy, and eyeballs would roll over the dollars spent.

This model was known in the data warehousing world by the wonderful acronym ETL which is short for extract, transform & load. The transform part happens before you load it. So that your warehouse is a shining, trimmed & manicured copy of your data, ready for reporting.

Also: Is Amazon too big to fail?

3. Today – mirror everything & then build views

Today you’re more likely to see the ELT model employed. That is Extract, Load & Transform. A subtle change, with big differences. When you load first, you mirror all of your transactional data into your warehouse, then build views or new summary tables to fit your ongoing needs.

Customers are using tools like Looker & Tableau to layer on top of these ELT warehouses which are also have some intelligence around the transform piece. This makes the process more self serve for business units, and requires less back & forth between engineering & product teams. No more waiting a few days for a report to be built, because these non-technical teams can build for themselves.

Also: When hosting data on Amazon turns bloodsport?

Is Data your dirty little secret?

4. Pipeline services

So you’re going down the ELT path, but how do get your data into Redshift? I wrote Five ways to get data into Redshift to answer that question.

There are a number of service based offerings from the point & click Fivetran to the more full featured Alooma. And then RJ Metrics & Flydata also fit the bill. You may also want to build your own with xplenty that also has a lot of ELT ETL logic you can build without code. Pretty spiffy.

Read: Is aws a patient that needs constant medication?

5. Reporting databases

We’ll be covering a lot lot more in this space, so check back.

Related: Does Amazon eat it’s own dogfood?

Get more. Grab our exclusive monthly Scalable Startups. We share tips and special content. Our latest Why I don’t work with recruiters

Five ways to get your data into Redshift

redshift data pipeline

Everybody is hot under the collar this data over Redshift. I heard one customer say, a query that took 10 Hours before now finishes in under a minute. Without modification. When businesses see 600 times speedup, that can change the way they do business.

Join 32,000 others and follow Sean Hull on twitter @hullsean.

What’s more Redshift is easy to deploy. No complicated licenses like the Oracle days. No hardware, just create your cluster & go.

So you’ve made the decision, and you have data in your transactional database, MySQL RDS or Postgres. Now what?

Here are some systems that will help you synchronize data on the regular. And keep it in sync. Most of these are near real-time, so you can expect reports to be looking at the data your business created today.

1. RJ Metrics Pipeline

One of the simplest options, RJ Metrics Pipeline. Setup a trial account, configure your Redshift credentials in the warehouse section (port, user, password, endpoint) and save. Then configure your data source. For MySQL specify hostname, user, password & port. You get the option to go through an ssh tunnel for security. That’s good. You’ll also be given the grant code to create a user in MySQL for RJM.

rjmetrics table config screen

RJM uses a primary or unique key to figure out which rows have changed. Well that’s not completely true. Only if you’re using incremental refresh. If you’re using complete refresh, then it just selects all the data & replaces it each time.

The user interface is a bit clunky. You have to go in and CONFIGURE EACH TABLE you want to replicate. There’s no REPLICATE-ALL option. This is a pain. If you have 500 tables, it might take hours to configure them all.

Also since RJM isn’t CDC (change data capture) based, it won’t be as close to real-time as some of the other options.

Still RJM works and it’s pretty point-n-click.

Also: Is Amazon too big to fail?

2. xplenty

xplenty is really a lot more than just a sync tool. It’s a full featured ETL system. Want to avoid writing tons of python jobs to convert datatypes, transform 0 to paid & 1 to free, things like that? Well xplenty is made to allow building ETL systems without code.

xplenty main dashboard

It’s a bit complex to setup at first, but very full featured. It is the DIY developer or DBAs tool of the bunch. If you need hardcore functionality, xplenty seems to have it.

Also: When hosting data on Amazon turns bloodsport?

Is Data your dirty little secret?

3. Alooma

Alooma might possibly be the most interesting of the bunch.

After a few stumbles during the setup process, we managed to get this up and running smoothly. Again as with xplenty & Fivetran, it uses CDC to grab changes from the MySQL binlogs. That means you get near realtime.

alooma dashboard

Although it’s a bit more complex to setup than Fivetran, it gives you a lot more. There’s excellent visibility around data errors, which you *will* have. Knowing where they happen, means your data team can be very proactive. This is great for the business.

What’s more there is a python based Code Engine which allows you to write bits of code that transform data in the pipeline. That’s huge! Want to do some simple ETL, this is a way to do that. Also you can send notifications, or requeue events. All this means you get state of the art pipeline, with good configurability & logging.

Read: Is aws a patient that needs constant medication?

4. Fivetran

Fivetran is super point-n-click. It is CDC based like Flydata & Alooma, so you’re gonna get near realtime sync with low overhead. It monitors your binlogs for changed data, and ships it to Redshift. No mess.

The dashboard is simple, the setup is trivial, and it just seems to work. Least pain, best bang.

Related: Does Amazon eat it’s own dogfood?

5. Other options

There are lots of other ways to get data into Redshift.

Flydata

I did manage to get Flydata working at a customer last year. It’s a very viable option. I wrote at length about that solution I’ll leave you to read all about it there.

AWS Data Pipeline

I’ve started to kick the tires of AWS Data Pipeline but haven’t decided if it’s the best option for customers.

Nightly rebuild

The Donors Choose Tech Blog posted about their project which can move data from postgres to redshift. You can find the project here.

This will do a *full* reload each night, so if your db is too big for that, it might need modifications. Also if you’re using MySQL as source db you’ll need to change code. One thing I found in there was Perl & Sed commands to transform your source schema CREATE & ALTER statements into Redshift compatible ones. That in itself is worth a look.

Lambda to the rescue

The awslabs github team has put together a lambda-based redshift loader. This might be just what you need. Remember thought that’ll you’ll need to deliver your source data in CSV files to S3 on the regular. So you’ll need some method to dump it. But if you have that half of the equation, this is ideal.

Data Migration Serve or DMS

This appears to have supported Redshift early on, but does not appear to do so now. I’ve gotten conflicting reports, so I should dig a bit more. Anybody want to comment on this one?

Tungsten

I tried & tried & tried to get Tungsten to work. I did have some success but was still blocked by data problems which remained unresolved. To my mind the project is still broken or at least very buggy.

Also: Is AWS too complex for small dev teams?

Get more. Grab our exclusive monthly Scalable Startups. We share tips and special content. Our latest Why I don’t work with recruiters

Is data your dirty little secret?

data comparison cloud

While I was fumbling for the dictionary to figure out what polyglot persistence was, the CTO had decided to build a warehouse on Redshift.

“Everybody’s moving data there.” He declared. I looked on quizzically.

Join 32,000 others and follow Sean Hull on twitter @hullsean.

“That’s a very new database engine”, I chimed in. “Lets do some testing”.

And there began an adventurous ride into the bleeding edge of Amazon’s new data service offerings!

1. The data scientist comes crying

Are our transactional database we were using Amazon RDS for MySQL. It’s a great managed service that eliminates some of the headaches of doing it yourself. I wrote about thisRDS or MySQL Use Cases.

We needed some way to get data over to Redshift. We evaluated AWS Data Pipeline, but it wasn’t realtime enough. In a pinch we decided on a service called Flydata. After weeks of effort to get it setup & administered we had it running smoothly.

I since discovered some pipelining solutions dedicated to Redshift such as Alooma – modern data plumbing, RJMetrics pipeline and Domo. I *did not* manage to get Tungsten working. It supports redshift on paper, but has a lot of growing up to do.

Until one data when the data scientist shows up at my desk. “We have problems in our data on redshift.”. I look back confused. “Are you sure? Can you tell me where you’re seeing that?” I respond.

Also: When hosting data on Amazon turns bloodsport

2. Deleted data reappears in Redshift!

He sends me over some queries, that I rerun myself. I see extra data in Redshift too, data that had been deleted in MySQL. Strange. We dig deeper together trying to figure out what’s happening.

We find that the tables with extra data are child tables of a parent where data was deleted. Imagine Citibank deletes a customer, they also want to delete the records for monthly bills. Otherwise those will just be hanging around, and won’t match up anymore with a parent record for a customer. In real life Citibank probably doesn’t delete like this but it’s a helpful example.

The first thing I do is open a ticket with Flydata. After all we hadn’t gotten any errors logged. Things *must* be running correctly.

After highlighting the severity of the issue, we setup a conference call with Flydata. Digging further they discover the problem. Child table data can’t get deleted on Redshift, because it doesn’t support ON DELETE CASCADE. Wait what?

Turns out Flydata makes use of the MySQL transaction log to move data. In mysql to mysql replication this works fine because downstream you also have MySQL. It also implements on delete cascade so those child records will get cleaned up correctly. Since Redshift doesn’t have this, there’s no way for Flydata to instruct Redshift what to do. Again I said, wait what?

My surprise wasn’t that a new unproven technology like Redshift had a lot of holes & missing features. My surprise was that Flydata was just silently ignoring the problem. No logged messages to tell the customer about inconsistencies. At least?

Related: Is Amazon too big to fail?

3. The problem – comparing data

As you might imagine, this is a terrible way to find out about data problems. As the person tasked with moving data between these systems, eyes were on me. My thought was, we chose a service-based solution, so they’re manage data movement. If there’s a problem, they’ll surely alert us.

From there the conversation became, ok, how do we figure out where all these data differences are? Is it widespread or isolated to a few tables? Can we mitigate it with changed queries? Cleanup on a daily basis? These are some questions that’ll immediately come to mind.

To answer them we needed a way to compare table data across different databases. This is hard to do within a homogenous environment where server versions & datatypes are likely to be the same. It is much more complicated when you’re working across heterogenous systems.

Read: 5 Reasons to move data to amazon redshift

4. Build some way to spot check data

Although this still doesn’t seem a solved problem, there are some tools. One way is to perform checksums on tables & rows. These can then be compared to find differences.

This drew me to find Jason Friedman’s
table hash script on Github. It can work across MySQL, Postgres & redshift. Pretty cool stuff if you ask me.

One problem remains. Databases are always in flux. As such you may find discrepancies based on data that hasn’t been moved yet. Data that’s just changed in the last few minutes.

If you refresh data nightly, you may for example be able to stop a slave to compare data at an instant in time.

Also: Is Redshift outpacing hadoop as the warehouse for startups?

5. The mentality: treat data as a product & monitor

Solving tough problems like these is a work in progress. What it taught me is that:

You should own your data pipeline

This allows you to be vigilant about monitoring, throw errors if data is different, and ultimately treat data as a product. Owning the pipeline will mean you can build monitoring around stages, and automate spot checks on your data.

You won’t get it perfect, but you want to know when it isn’t.

Also: 5 core pieces of the Amazon puzzle to get your project off the ground

Get more. Grab our exclusive monthly Scalable Startups. We share tips and special content. Our latest Why I don’t work with recruiters

Is Redshift outpacing Hadoop as the big data warehouse for startups?

redshift hadoop killer

More and more startups are looking at Redshift as a cheaper & faster solution for big data & analytics.

Join 32,000 others and follow Sean Hull on twitter @hullsean.

Saggi Neumann posted a pretty good side-by-side comparison of Redshift & Hadoop and concluded they were tied based on your individual use case.

Meanwhile Bitly engineering concluded Redshift was much easier.

1. More agile

One thing pointed out by the bitly blog post, which I’ve seen countless times, is the slow iteration cycle. Write your map-reduce job, run, test, debug, then run on your cluster. Wait for it to return and you might feel like you’re submitting a stack of punched cards. LOL Resolve the errors that come back and then rerun on your cluster. Over & over & over again.

With Redshift you’re writing SQL, so your iterating through syntax errors quickly. What’s more since Redshift is a column-compressed database, you can do full table scans on columns without indexes.

What that means for you and me is that queries just run. And they run blazingly fast!

Also: When hosting data on Amazon turns bloodsport

2. Cheap

Redshift is pretty darn cheap.

Saggi’s article above quotes Redshift at $1000/TB/yr for reserved, and $3700/TB/yr for on-demand. This compared with a hadoop cluster at $5000/TB/yr.

But neither will come with spitting distance of the old-world of Oracle, where customers host big iron servers in their own datacenter, paying north of a million dollars between hardware & license costs. Amazon cloud FTW!

Related: Did dropbox have to fail?

3. Even faster

Airbnb’s nerds blog has a post showing it costing 25% of a Hadoop cluster, and getting 5x performance boost. That’s pretty darn impressive!

Flydata has done benchmarks showing 10x speedup.

Read: Are SQL Databases dead?

4. SQL Toolchains

***

Also: 5 core pieces of the Amazon cloud puzzle to get your project off the ground

5. Limitations

o data loading

You load data into Redshift using the COPY command. This command reads flat files from S3 and dumps them into tables. It can be extremely fast if you do things in parallel. However getting your data into those flat files is up to you.

There are a few solutions to this.

– amazon data pipeline

This is Amazon’s own toolchain, which allows you to move data from RDS & other Amazon hosted data sources. Data pipeline does not move data realtime, but in batch. Also it doesn’t take care of schema changes so you have to do that manually.

I mentioned it in my 5 reasons to move data to Amazon Redshift

– Flydata service

Flydata is a service with a monthly subscription which will connect to your RDS database, and move the data into Redshift. This seems like a no brainer, and given the heft pricetag of thousands per month, you’d expect it to cover your bases.

In my experience there are a lot of problems & it still required a lot of administration. When schema changes happen, those have to be carefully applied on Redshift. What’s more there’s no silver bullet around the datatype differences.

Also: Some thoughts on 12 factor apps

Flydata also makes use of the binary logs to replicate your data. Anything that doesn’t show up in the binary logs is going to cause you trouble. That includes when you do sql_log_bin=0 in the session, an SQL statement includes a no logging hint. Also watch out for replicate-ignore-db options in your my.cnf. But it also will fail if you use ON DELETE CASCADE. That’s because these downstream changes happen via Constraint in MySQL. But… drumroll please, Redshift doesn’t support ON DELETE CASCADE. In our case the child tables ended up with extra rows, and some queries broke.

– scripts such as Donors choose loader

Donors Choose has open sourced their nightly Redshift loader script. It appears to reload all data each night. This will nicely sidestep the ON DELETE CASCADE problem. As you grow though you may quickly hit a point where you can’t load the entire data set each night.

Their script sources from Postgres, though I’m interested to see if it can be modified for MySQL RDS.

– Tried & failed with Tungsten replicator

Theoretically Tungsten replicator can do the above. What’s more it seems like a tool custom made for such a use case. I tried for over a month to troubleshoot. I worked closely with the team to iron out bugs. I wrote wrestling with bears or how I tamed Tungsten replicator for MySQL and then I wrote a second article Tungsten replicator the good the bad & the ugly. Ultimately I did get some data moving between MySQL RDS & Redshift, however certain data clogged the system & it wouldn’t work for any length of time.

Also: Secrets of a happy Amazon hacker or how to lock down your account with IAM and multi-factor authentication

o data types & character sets

There are a few things here to keep in mind. Redshift counts bytes, so if in mysql or some other database you had a varchar(5) it may be varchar(20) in Redshift. Even then I had cases where it still didn’t fit & I had to make the field bigger by 4.

I also ran into problems around string character encodings. According to the docs Redshift handles 4-byte UTF-8.

Redshift doesn’t support ARRAYs, BIT, BYTEA, DATE/TIME, ENUM, JSON and a bunch of others. So don’t go into it expecting full Postgres support.

What you will get are multibyte characters, numeric, character, datetime, boolean and some type conversion.

Also: Is the difference between dev & ops a four-letter word?

o rebalancing

If and when you want to add nodes, expect some downtime. Yes theoretically the database is online while it’s shipping data to the new nodes & redistributing things, the latency can start to feel like an outage. What’s more it can easily push into the hours to do.

Also: Is AWS enabling startups which enable AngelList Syndicates to boil the VC business?

Get more. Grab our exclusive monthly Scalable Startups. We share tips and special content. Our latest Why I don’t work with recruiters

Why I like Etsy’s site performance report

etsy code as craft

Etsy publishes a great tech blog titled Code As Craft.

Join 28,000 others and follow Sean Hull on twitter @hullsean.

I was recently sifting through some of their newer posts & stumbled upon their Q2 2015 Site Performance Report. It’s really in-depth, though not impossibly technical. Here’s what I liked.

1. Transparency to business & public

Show real performance to customers

The first thing I thought while reading, is the strong show of transparency. The blog is public, so it’s not just an internally facing document that shares with the company, but sharing with the wider world. True, presented as a technical post it may only appeal to a segment of readers, but it’s great none the less.

Show real performance to non-technical business units

I think this kind of analysis & summary also provides transparency to the business itself. Product teams, business operations & sales teams can all view what’s happening. Where are there problems? What is being done to address them?

Also: When hosting data on Amazon turns bloodsport

2. Highlighting change

Added pagination to the cart

One thing that popped out, was the discussion of pagination changes, that impacted page load times in the shopping cart. Page load times in the shopping cart are particularly crucial, because that’s where customers can “abandon” an order out of frustration.

Illustrating performance impact to product decisions

When product is evaluating that new feature, and they can see how changes affect performance, it better *sells* what all those engineering resources are being used for.

Related: 5 reasons to move data to amazon redshift

3. Where we don’t have data

We can’t analyze what data we haven’t captured

The report highlights that data around the shopping cart is new. That’s great because it highlights what the value collecting data offers, by providing new insights that were not available previously. This also pushes for more metrics collection & analysis as the business begins to see the value of all of this gymnastics.

Read: Is Amazon too big to fail?

4. Product tradeoffs

The discussion around the shopping cart performance also illustrates how the business makes product decisions. The engineering team can only build & write so much code. Deciding to spend time on pagination, means time not spent on some other new feature. Which is more valuable? Selling new feature A in one corner of the product, that customers may spend real money on? Or speeding up page load times on page B?

Also: Is Apple betting against big data?

5. Cleaner data

At a Look & Tell event, I heard Lincoln Ritter talk about Data as a product to the business.

When you expose a performance report like this to the business, an iterative process begins to happen. The company gains insight from the report, makes better decisions, and thus can spend more energy time & resources on clean data. Cleaner data in term means better reports, which produce better decisions & so on.

Also: What is venue analytics & why is it important?

Get more. Grab our exclusive monthly Scalable Startups. We share tips and special content. Our latest Why I don’t work with recruiters