Category Archives: Data

How to interview an amazon database expert

via GIPHY

Amazon releases a new database offering every other day. It sure isn’t easy to keep up.

Join 35,000 others and follow Sean Hull on twitter @hullsean.

Let’s say you’re hiring a devops & you want to suss out their database knowledge? Or you’re hiring a professional services firm or freelance consultant. Whatever the case you’ll need to sift through for the best people. Here’s how.

Also: How to interview an AWS expert

What database does Amazon support for caching?

Caching is a popular way to speed up access to your backend database. Put Amazon’s elasticache behind your webserver, and you can reduce load on your database by 90%. Nice!

The two types that amazon supports are Memcache & Redis. Memcache is historically more popular. These days Redis seems a clear winner. It’s faster, and can maintain your cached data between restarts. That will save you I promise!

Also: Is AWS too complex for small dev teams?

How can I store big data in AWS?

Amazon’s data warehouse offering is called Redshift. I wrote Why is everyone suddenly talking about Redshift?. Why indeed!

When you’re doing large reports for your business intelligence team, you don’t want to bog down your backend relational database. Redshift is purpose built for this use case.

I’ve see a report that took over 8 hours in MySQL return in under 60 seconds in Redshift!

A new offering is Amazon Spectrum. This tech is super cool. Load up all your data into S3, in standard CSV format. Then without even loading it into Redshift, you can query the S3 data directly. This is super useful. Firstly because S3 is 1/10th the price. But also because it allows you to stage your data before loading into Redshift itself. Goodbye Google Big Query! I talked about spectrum here.

Related: Which engineering roles are in greatest demand?

What relational database options are there on Amazon?

Amazon supports a number of options through it’s Relational Database Service or RDS. This is managed databases, which means less work on your DBAs shoulders. It also may make upgrades slower and harder with more downtime, but you get what you pay for.

There are a lot of platforms available. As you might guess MySQL & Postgres are there. Great! Even better you can use MariaDB if that’s your favorite. You can also go with Aurora which is Amazon’s own home-brew drop in replacement for MySQL that promises greater durability and some speedups.

If you’re a glutton for punishment, you can even get Oracle & SQL Server working on RDS. Very nice!

Read: Can on-demand consulting save startups time & money?

Does AWS have a NoSQL database solution?

If NoSQL is to your taste, Amazon has DynamoDB. According to . I haven’t seen a lot of large production applications using it, but what he describes makes a lot of sense. The way Amazon scales nodes & data I/O is bound to run into real performance problems.

That said it can be a great way to get you up and running quickly.

Read: Can on-demand consulting save startups time & money?

How do I do ETL & migrate data to AWS?

Let’s be honest, Amazon wants to make this really easy. The quicker & simpler it is to get your data there, that more you’ll buy!

Amazon’s Database Migration Service or DMS allows you to configure your old database as a data source, then choose a Amazon db solution as destination, then just turn on the spigot and pump your data in!

ETL is extract transform and load, data warehouse terminology for slicing and dicing data before you load it into your warehouse. Many of todays warehouses are being built with the data lake model, because databases like Redshift have gotten so damn fast. That model means you stage all your source data as-is in your warehouse, then build views & summary tables as needed to speed up queries & reports. Even better you might look a tool like xplenty.

Amazon’s new offering is called Glue. Five ways to get data into Amazon Redshift. This solution is purpose build for creating a powerful data pipeline, complete with python code to do transformations.

Read: Is data your dirty little secret?

Get more. Grab our exclusive monthly Scalable Startups. We share tips and special content. Our latest Why I don’t work with recruiters

What does the fight between palantir & nypd mean for your data?

via GIPHY

In a recent buzzfeed piece, NYPD goes to the mat with Palantir over their data. It seems the NYPD has recently gotten cold feet.

Join 35,000 others and follow Sean Hull on twitter @hullsean.

As they explored options, they found an alternative that might save them a boatload of money. They considered switching to an IBM alternative called Cobalt.

And I mean this is Silicon Valley, what could go wrong?

Related: Will SQL just die already?

Who owns your data?

In the case of Palantir, they claim to be an open system. And of course this is good marketing. Essential in fact to get the contract. Promise that it’s easy to switch. Don’t dig too deep into the technical details there. According to the article, Palantir spokeperson claims:

“Palantir is an open platform. As with all our customers, their data & analysis are available to them at all times in an open & nonproprietary format.”

And that does appear to be true. What appears to be troubling NYPD isn’t that they can’t get the analysis, for that’s available to them in perpetuity. Within the Palantir system. But getting access to how the analysis is done, well now that’s the secret sauce. Palantir of course is not going to let go of that.

And that’s the devil in the details when you want to switch to a competing service.

Also: Top serverless interview questions for hiring aws lambda experts

Who owns the algorithms?

Although the NYPD can get their data into & out of the Palantir system easily, that’s just referring to the raw data. That’s the data they ingested in the first place, arrest records, license plate reads, parking tickets, stuff like that.

“This notion of how portable your data is when you engage in a contract with a platform is really, really complex, and hasn’t really been tested” – Tal Klein

Palantir’s secret sauce, their intellectual property, is finding the needle in the haystack. What pieces of data are relevant & how can I present the detectives the right information at the right time.

Analysis *is* the algorithms. It’s the big data 64 million dollar question. Or in this case $3.5 million per year, as the contract is reported to be worth!

Related: Which engineering roles are in greatest demand?

The nature of software as a service

The web is bringing us great platforms, like google & amazon cloud. It’s bringing a myriad of AI solutions to our fingertips. Palantir is providing a push button solution to those in need of insights like the NYPD.

The Cobalt solution that IBM is offering goes the other way. Build it yourself, manage it, and crucially control it. And that’s the difference.

It remains to be seen how the rush to migrate the universe of computing to Amazon’s own cloud will settle out. Right now their in a growth phase, so it’s all about lowering prices. But at some point their market muscle will mean they can go the Oracle route a la Larry Ellison. That’s why customers start feeling the squeeze.

If the NYPD example is any indication, it could get ugly!

Read: Can on-demand consulting save startups time & money?

Get more. Grab our exclusive monthly Scalable Startups. We share tips and special content. Our latest Why I don’t work with recruiters

A roughneck walk down database alley

via GIPHY

I was just responding to some Disqus comments on a recent blog post. Admittedly it had a provocative title Will SQL databases just die already. What do you think?

Join 34,000 others and follow Sean Hull on twitter @hullsean.

A reader pointed out that some No-SQL databases do support joins. Huh? My face contorts in total confusion. How? Why?

For years SQL was misunderstood & unloved

Relational databases have been around for decades. 43 years according to the StackExchange article. That’s a lot of years. I’ve spent a few years as a dba, aka database administrator. The role can be distilled down as a herder of sorts. Keep all the data bits in the right boxes with the right labels. A digital librarian, that makes sure the books don’t get lost.

Of course patrons don’t always put books back where they should, and strict rules get put in place to avoid losing that one volume of shakespear in miles of shelves.

In the fast moving world of web + mobile, product is king, and agile rules the day. And anything that can make us more agile also wins. SQL, much maligned & misunderstood, was not one of those things.

Also: Top serverless interview questions for hiring aws lambda experts

NoSQL burst on the scene with much fanfare

With all that pressure, it was no wonder engineers thought, there must be a better way. Then along comes the No SQL database. I mean just the name speaks volumes about the design goal.

We’ll sacrifice anything, just please don’t make me write SQL!

The promises…

1. Never have to deal with pesky SQL that we don’t understand!
2. Interact with the database like any other data structure in our code!
3. Be schemaless! Crotchety Database Administrators be damned!
4. Be distributed. Be everywhere consistent! Be indistinguishable from magic!
5. Always be fast.

In that rush into the abyss, we lost track of durability. And down the rabbithole we went!

Related: Which engineering roles are in greatest demand?

Relational databases tried to be key-value

Then I started hearing about crazy things, like MySQL providing a memcache plugin, so you could use it albeit lightening fast, as a key-value store. You could sidestep that pesky SQL engine, and get right down to the bare metal. But why? Memcache & Redis were already doing that & purpose built. Why indeed?

I started to argue maybe we shouldn’t be muddying the waters. I mean stick with what you know!

Read: Can on-demand consulting save startups time & money?

War was won, success declared

Around this I think was when Mongodb was declaring the war won. We had finally left SQL databases in the dustheap of history. It may or may not have inspired this popular youtube skit…

Also: 30 questions to ask a serverless fanboy

Meanwhile hadoop is losing ground. Bigquery & Redshift both speak SQL

But then something funny started to happen. It seemed there was a backlash against Mongodb. A lot of customers were losing data. (Yep that’s what durability means guys…) And the hype started reversing. Even the mighty hadoop has been losing popularity of late. How long does it take to write an EMR job versus an SQL query. Let’s be honest?

I asked myself, Is Hadoop losing ground to SQL warehouses like Redshift & Bigquery?. I wonder.

Also: What can startups learn from the DYN DNS outage?

NoSQL databases are looking for JOINs?

Recently I bumped into some interesting blog comments & discussions about how Orientdb was trying to add joins to their product.

As certain relational databases try to become No SQL databases, other No SQL databases are trying to add more complex SQL, because well somehow their product is missing something.

Also: What can startups learn from the DYN DNS outage?

Engineering truth versus fashion

43 years is a lot of years. And when we drop all the fashion trends in tech, and the new database du jour, what do we find?

There is room for No SQL databases. Yep. And the do certain things, and solve certain types of problems well. But their not general workhorses, nor can they slice and dice your data however you like. And when you get to that point in your project, you’re going to want to ask interesting questions of your data.

And surely that’s where SQL excels. It ain’t going anywhere, folks!

Also: What can startups learn from the DYN DNS outage?

Get more. Grab our exclusive monthly Scalable Startups. We share tips and special content. Our latest Why I don’t work with recruiters

Will SQL just die already?

With tons of new No-SQL database offerings everyday, developers & architects have a lot of options. Cassandra, Mongodb, Couchdb, Dynamodb & Firebase to name a few.

Join 33,000 others and follow Sean Hull on twitter @hullsean.

What’s more in the data warehouse space, you have Hadoop, which can churn through terabytes of data and get you results back before lunchtime!

So when I stumbled on this article SQL is 43 years old, I was intrigued.

Answer the questions you haven’t thought of

No-SQL databases are great if you know how you want to access the data. Users come from the users table, and that’s that!

But if later on you want to ask questions like, which users watched this video, which users are active, which users spent $100 in January? These questions may not be possible because NoSQL can’t join those other tables.

Relational databases shine when you need to aggregate your data, reorganize it, or ask unanticipated questions. And aren’t those most of the interesting questions?

Also: Top serverless interview questions for hiring aws lambda experts

Big Query, Redshift & even Hive speak SQL

I wrote that despite recent popularity in Hadoop, Redshift seems to be eating their lunch. And what would you know, surprise surprise, Amazon’s newish data warehousing solution, speaks SQL! What’s more there’s Apache Hive, which allows you to query Hadoop with, drumroll please… SQL!

Bigquery is the other major bigdata offering from none other than Google. And it too uses SQL!

Related: Which engineering roles are in greatest demand?

Still dominant

If you look at Stackoverflow’s developer survey, you’ll see that SQL is the second most popular language. Why might that be? For one thing it’s simple to learn. Enough that even business users can write simple requests, join & aggregate data.

Read: Can on-demand consulting save startups time & money?

Rugged, Proven & Open

SQL having been around so long is a fairly open standard. Sure there are extensions of it, but most of the basic stuff is there in all the products. That means you learn it once, and can interact with databases across the spectrum. That’s a win for everybody.

Also: 30 questions to ask a serverless fanboy

Business users can write it

Another under appreciated feature though is that basic queries are easy to write. They don’t require complex syntax like a hadoop job, or your favorite imperative programming language. The queries are readable, almost english-like sentences.

Given all that, it seems SQL is likely to be around for a long time to come!

Also: What can startups learn from the DYN DNS outage?

Get more. Grab our exclusive monthly Scalable Startups. We share tips and special content. Our latest Why I don’t work with recruiters

Is Amazon about to disrupt your data warehouse?

via GIPHY

Amazon is about to launch a product called glue. As you can see below, this is the last piece in the data warehousing puzzle. With that in place, Amazon will own you! Or at least have push button products to meet all of enterprises varying needs.

Even if you’re a small startup, you can do big-shot big enterprise data warehousing. That means everyone can use cutting edge data driven techniques for product & business decisions.

Join 33,000 others and follow Sean Hull on twitter @hullsean.

What is Redshift

Redshift is like the OLAP databases of years past, the Oracle’s of the world purpose built for warehousing data. Obviously without the crazy licensing model Oracle was famous for. With Amazon you can get enterprise class data warehouse for modest hourly prices.

If my recent conversations with recruiters about Redshift demand are any indication, there’s been a sudden uptick in startups looking for redshift expertise.

Also: Top serverless interview questions for hiring aws lambda experts

What is Spectrum?

Spectrum is a very new extension of Redshift allowing you to access & query S3 file data directly. This means you can have petabytes of data that you can access pre-load time. So you will ETL and load portions of it, but with Spectrum you can still access the offline data too.

In the old Oracle days this was called an EXTERNAL TABLE. I mention this only to say that Amazon isn’t doing anything that hasn’t been done before. Rather they’re bringing these advanced features within reach of everyday startups. That’s cool.

Related: Which engineering roles are in greatest demand?

What is glue?

Glue is still in beta, but if the RE:Invent talk above is any indication, it’s set to disrupt an entire industry. Wow!

Glue first catalogs your data sources. What does this mean, it scans them & models their schemas.

It then generates sample python ETL code. Modify it, or write your own. Share your code on Git. Or borrow other open source pieces, that already address your specific ETL use case!

Lastly it includes a job scheduler which handles dependencies. Job A must be completed before B can run and so forth. Error handling & logging are also all included.

Since these are native Amazon services, of course they’re going to integrate with their dangerously fast Redshift warehouse.

Read: Can on-demand consulting save startups time & money?

What is serverless?

I’ve written about how to throw fastballs at a serverless fanboy and even how to hire a serverless expert. But really what is it?

Serverless means deploying functions directly into the cloud. No servers, no configuration. All the systems administration & automation is hidden. No more devops to argue with! Amazon’s own offering is called Lambda.

Also: 30 questions to ask a serverless fanboy

What is Quicksight?

Amazon’s even jumped into the fray at the presentation layer. Quicksight is a BI tool along the lines of mode, domo, looker or Tableau.

Now it’s possible to stay completely within the cozy Amazon ecosystem even for business insight and analytics.

Also: What can startups learn from the DYN DNS outage?

Get more. Grab our exclusive monthly Scalable Startups. We share tips and special content. Our latest Why I don’t work with recruiters

Key lessons from the Devops Handbook

I picked up a copy of the DevOps Handbook.

This is not a book about how to setup Amazon servers, how to use git, codePipeline or Jenkins. It’s not about Chef or Ansible or other tools.

Join 33,000 others and follow Sean Hull on twitter @hullsean.

This is a book about processes & people. It’s about how & why automation & world-class infrastructure will make your business more agile, raise quality & increase productivity.

1. Infrastructure in version control

With technologies like Terraform and CloudFormation, the entire state of your infrastructure can be captured. That means you can manage it just like any other code.

Also: Myth of five nines – Why high availability is overrated

2. Pushbutton builds

You’ve heard it before. Automate your builds. That means putting everything in version control, from environment building scripts, to configs, artifacts & reference data. Once you can do that, you’re on your way to automating production deploys completely.

Related: 5 ways to move data to amazon redshift

3. Devs & Ops comingled

In the devops world, devs should learn about operations, infrastructure, performance & more. What’s more operations teams should work closely with devs.

Read: Why were dev & ops siloed job roles?

4. Servers as cattle not pets

In the old days, we logged into servers & provided personal care & feeding. We treated them like pets.

In the new world of devops, we should treat servers like cattle. When it begins to fail, take it out back and shoot it. (tbh i don’t love the analogy, but it carries some meaning…)

Also: Are SQL databases dead?

5. Open to learnings & failures

Organizations that are open to failures, without playing the blame game, learn quicker & recover from problems faster.

Also: Is Amazon too big to fail?

Get more. Grab our exclusive monthly Scalable Startups. We share tips and special content. Our latest Why I don’t work with recruiters

Is there a new better way to build a data warehouse in 2016?

redshift warehouse

In the old days… the bygone days of 2005 🙂 That was when you’d pony up for an Oracle license, get the hardware, and build your warehouse. Somewhere along the way you crossed your fingers.

Join 32,000 others and follow Sean Hull on twitter @hullsean.

Today everybody wants to treat data as a product. And for good reason. Knowing how to better server your customers & iterate more quickly is essential in todays hypercompetitive startup world.

1. Amazon Redshift enters the fray

Recently I’ve been wondering why is everyone suddenly talking about Amazon Redshift?? I ask not because recruiters are experts at database technology & predicting the industry trends, but rather because they have their finger on the pulse of what firms are doing.

Amazon launched Redshift in early 2013 using ParAccel technology. Adoption has been quick. Customers who already have their data in the AWS ecosystem find the offering a perfect match for their data analytics needs. And with stories swirling around of 10 hour MySQL reports running in under 60 seconds on Redshift, it’s no wonder.

Also: Is AWS too complex for small dev teams?

2. Old method – select carefully

Ralph Kimball’s opus having fully digested, you set out to meet with stakeholders, and figure out what you were building.

Of course no one understood your questions, and business units & engineering teams spoke english & french. Months went by, and things devolved. Morale got squashed. Eventually out the other end something would be built, nobody would be happy, and eyeballs would roll over the dollars spent.

This model was known in the data warehousing world by the wonderful acronym ETL which is short for extract, transform & load. The transform part happens before you load it. So that your warehouse is a shining, trimmed & manicured copy of your data, ready for reporting.

Also: Is Amazon too big to fail?

3. Today – mirror everything & then build views

Today you’re more likely to see the ELT model employed. That is Extract, Load & Transform. A subtle change, with big differences. When you load first, you mirror all of your transactional data into your warehouse, then build views or new summary tables to fit your ongoing needs.

Customers are using tools like Looker & Tableau to layer on top of these ELT warehouses which are also have some intelligence around the transform piece. This makes the process more self serve for business units, and requires less back & forth between engineering & product teams. No more waiting a few days for a report to be built, because these non-technical teams can build for themselves.

Also: When hosting data on Amazon turns bloodsport?

Is Data your dirty little secret?

4. Pipeline services

So you’re going down the ELT path, but how do get your data into Redshift? I wrote Five ways to get data into Redshift to answer that question.

There are a number of service based offerings from the point & click Fivetran to the more full featured Alooma. And then RJ Metrics & Flydata also fit the bill. You may also want to build your own with xplenty that also has a lot of ELT ETL logic you can build without code. Pretty spiffy.

Read: Is aws a patient that needs constant medication?

5. Reporting databases

We’ll be covering a lot lot more in this space, so check back.

Related: Does Amazon eat it’s own dogfood?

Get more. Grab our exclusive monthly Scalable Startups. We share tips and special content. Our latest Why I don’t work with recruiters

Five ways to get your data into Redshift

redshift data pipeline

Everybody is hot under the collar this data over Redshift. I heard one customer say, a query that took 10 Hours before now finishes in under a minute. Without modification. When businesses see 600 times speedup, that can change the way they do business.

Join 32,000 others and follow Sean Hull on twitter @hullsean.

What’s more Redshift is easy to deploy. No complicated licenses like the Oracle days. No hardware, just create your cluster & go.

So you’ve made the decision, and you have data in your transactional database, MySQL RDS or Postgres. Now what?

Here are some systems that will help you synchronize data on the regular. And keep it in sync. Most of these are near real-time, so you can expect reports to be looking at the data your business created today.

1. RJ Metrics Pipeline

One of the simplest options, RJ Metrics Pipeline. Setup a trial account, configure your Redshift credentials in the warehouse section (port, user, password, endpoint) and save. Then configure your data source. For MySQL specify hostname, user, password & port. You get the option to go through an ssh tunnel for security. That’s good. You’ll also be given the grant code to create a user in MySQL for RJM.

rjmetrics table config screen

RJM uses a primary or unique key to figure out which rows have changed. Well that’s not completely true. Only if you’re using incremental refresh. If you’re using complete refresh, then it just selects all the data & replaces it each time.

The user interface is a bit clunky. You have to go in and CONFIGURE EACH TABLE you want to replicate. There’s no REPLICATE-ALL option. This is a pain. If you have 500 tables, it might take hours to configure them all.

Also since RJM isn’t CDC (change data capture) based, it won’t be as close to real-time as some of the other options.

Still RJM works and it’s pretty point-n-click.

Also: Is Amazon too big to fail?

2. xplenty

xplenty is really a lot more than just a sync tool. It’s a full featured ETL system. Want to avoid writing tons of python jobs to convert datatypes, transform 0 to paid & 1 to free, things like that? Well xplenty is made to allow building ETL systems without code.

xplenty main dashboard

It’s a bit complex to setup at first, but very full featured. It is the DIY developer or DBAs tool of the bunch. If you need hardcore functionality, xplenty seems to have it.

Also: When hosting data on Amazon turns bloodsport?

Is Data your dirty little secret?

3. Alooma

Alooma might possibly be the most interesting of the bunch.

After a few stumbles during the setup process, we managed to get this up and running smoothly. Again as with xplenty & Fivetran, it uses CDC to grab changes from the MySQL binlogs. That means you get near realtime.

alooma dashboard

Although it’s a bit more complex to setup than Fivetran, it gives you a lot more. There’s excellent visibility around data errors, which you *will* have. Knowing where they happen, means your data team can be very proactive. This is great for the business.

What’s more there is a python based Code Engine which allows you to write bits of code that transform data in the pipeline. That’s huge! Want to do some simple ETL, this is a way to do that. Also you can send notifications, or requeue events. All this means you get state of the art pipeline, with good configurability & logging.

Read: Is aws a patient that needs constant medication?

4. Fivetran

Fivetran is super point-n-click. It is CDC based like Flydata & Alooma, so you’re gonna get near realtime sync with low overhead. It monitors your binlogs for changed data, and ships it to Redshift. No mess.

The dashboard is simple, the setup is trivial, and it just seems to work. Least pain, best bang.

Related: Does Amazon eat it’s own dogfood?

5. Other options

There are lots of other ways to get data into Redshift.

Flydata

I did manage to get Flydata working at a customer last year. It’s a very viable option. I wrote at length about that solution I’ll leave you to read all about it there.

AWS Data Pipeline

I’ve started to kick the tires of AWS Data Pipeline but haven’t decided if it’s the best option for customers.

Nightly rebuild

The Donors Choose Tech Blog posted about their project which can move data from postgres to redshift. You can find the project here.

This will do a *full* reload each night, so if your db is too big for that, it might need modifications. Also if you’re using MySQL as source db you’ll need to change code. One thing I found in there was Perl & Sed commands to transform your source schema CREATE & ALTER statements into Redshift compatible ones. That in itself is worth a look.

Lambda to the rescue

The awslabs github team has put together a lambda-based redshift loader. This might be just what you need. Remember thought that’ll you’ll need to deliver your source data in CSV files to S3 on the regular. So you’ll need some method to dump it. But if you have that half of the equation, this is ideal.

Data Migration Serve or DMS

This appears to have supported Redshift early on, but does not appear to do so now. I’ve gotten conflicting reports, so I should dig a bit more. Anybody want to comment on this one?

Tungsten

I tried & tried & tried to get Tungsten to work. I did have some success but was still blocked by data problems which remained unresolved. To my mind the project is still broken or at least very buggy.

Also: Is AWS too complex for small dev teams?

Get more. Grab our exclusive monthly Scalable Startups. We share tips and special content. Our latest Why I don’t work with recruiters

Is data your dirty little secret?

data comparison cloud

While I was fumbling for the dictionary to figure out what polyglot persistence was, the CTO had decided to build a warehouse on Redshift.

“Everybody’s moving data there.” He declared. I looked on quizzically.

Join 32,000 others and follow Sean Hull on twitter @hullsean.

“That’s a very new database engine”, I chimed in. “Lets do some testing”.

And there began an adventurous ride into the bleeding edge of Amazon’s new data service offerings!

1. The data scientist comes crying

Are our transactional database we were using Amazon RDS for MySQL. It’s a great managed service that eliminates some of the headaches of doing it yourself. I wrote about thisRDS or MySQL Use Cases.

We needed some way to get data over to Redshift. We evaluated AWS Data Pipeline, but it wasn’t realtime enough. In a pinch we decided on a service called Flydata. After weeks of effort to get it setup & administered we had it running smoothly.

I since discovered some pipelining solutions dedicated to Redshift such as Alooma – modern data plumbing, RJMetrics pipeline and Domo. I *did not* manage to get Tungsten working. It supports redshift on paper, but has a lot of growing up to do.

Until one data when the data scientist shows up at my desk. “We have problems in our data on redshift.”. I look back confused. “Are you sure? Can you tell me where you’re seeing that?” I respond.

Also: When hosting data on Amazon turns bloodsport

2. Deleted data reappears in Redshift!

He sends me over some queries, that I rerun myself. I see extra data in Redshift too, data that had been deleted in MySQL. Strange. We dig deeper together trying to figure out what’s happening.

We find that the tables with extra data are child tables of a parent where data was deleted. Imagine Citibank deletes a customer, they also want to delete the records for monthly bills. Otherwise those will just be hanging around, and won’t match up anymore with a parent record for a customer. In real life Citibank probably doesn’t delete like this but it’s a helpful example.

The first thing I do is open a ticket with Flydata. After all we hadn’t gotten any errors logged. Things *must* be running correctly.

After highlighting the severity of the issue, we setup a conference call with Flydata. Digging further they discover the problem. Child table data can’t get deleted on Redshift, because it doesn’t support ON DELETE CASCADE. Wait what?

Turns out Flydata makes use of the MySQL transaction log to move data. In mysql to mysql replication this works fine because downstream you also have MySQL. It also implements on delete cascade so those child records will get cleaned up correctly. Since Redshift doesn’t have this, there’s no way for Flydata to instruct Redshift what to do. Again I said, wait what?

My surprise wasn’t that a new unproven technology like Redshift had a lot of holes & missing features. My surprise was that Flydata was just silently ignoring the problem. No logged messages to tell the customer about inconsistencies. At least?

Related: Is Amazon too big to fail?

3. The problem – comparing data

As you might imagine, this is a terrible way to find out about data problems. As the person tasked with moving data between these systems, eyes were on me. My thought was, we chose a service-based solution, so they’re manage data movement. If there’s a problem, they’ll surely alert us.

From there the conversation became, ok, how do we figure out where all these data differences are? Is it widespread or isolated to a few tables? Can we mitigate it with changed queries? Cleanup on a daily basis? These are some questions that’ll immediately come to mind.

To answer them we needed a way to compare table data across different databases. This is hard to do within a homogenous environment where server versions & datatypes are likely to be the same. It is much more complicated when you’re working across heterogenous systems.

Read: 5 Reasons to move data to amazon redshift

4. Build some way to spot check data

Although this still doesn’t seem a solved problem, there are some tools. One way is to perform checksums on tables & rows. These can then be compared to find differences.

This drew me to find Jason Friedman’s
table hash script on Github. It can work across MySQL, Postgres & redshift. Pretty cool stuff if you ask me.

One problem remains. Databases are always in flux. As such you may find discrepancies based on data that hasn’t been moved yet. Data that’s just changed in the last few minutes.

If you refresh data nightly, you may for example be able to stop a slave to compare data at an instant in time.

Also: Is Redshift outpacing hadoop as the warehouse for startups?

5. The mentality: treat data as a product & monitor

Solving tough problems like these is a work in progress. What it taught me is that:

You should own your data pipeline

This allows you to be vigilant about monitoring, throw errors if data is different, and ultimately treat data as a product. Owning the pipeline will mean you can build monitoring around stages, and automate spot checks on your data.

You won’t get it perfect, but you want to know when it isn’t.

Also: 5 core pieces of the Amazon puzzle to get your project off the ground

Get more. Grab our exclusive monthly Scalable Startups. We share tips and special content. Our latest Why I don’t work with recruiters

Is Redshift outpacing Hadoop as the big data warehouse for startups?

redshift hadoop killer

More and more startups are looking at Redshift as a cheaper & faster solution for big data & analytics.

Join 32,000 others and follow Sean Hull on twitter @hullsean.

Saggi Neumann posted a pretty good side-by-side comparison of Redshift & Hadoop and concluded they were tied based on your individual use case.

Meanwhile Bitly engineering concluded Redshift was much easier.

1. More agile

One thing pointed out by the bitly blog post, which I’ve seen countless times, is the slow iteration cycle. Write your map-reduce job, run, test, debug, then run on your cluster. Wait for it to return and you might feel like you’re submitting a stack of punched cards. LOL Resolve the errors that come back and then rerun on your cluster. Over & over & over again.

With Redshift you’re writing SQL, so your iterating through syntax errors quickly. What’s more since Redshift is a column-compressed database, you can do full table scans on columns without indexes.

What that means for you and me is that queries just run. And they run blazingly fast!

Also: When hosting data on Amazon turns bloodsport

2. Cheap

Redshift is pretty darn cheap.

Saggi’s article above quotes Redshift at $1000/TB/yr for reserved, and $3700/TB/yr for on-demand. This compared with a hadoop cluster at $5000/TB/yr.

But neither will come with spitting distance of the old-world of Oracle, where customers host big iron servers in their own datacenter, paying north of a million dollars between hardware & license costs. Amazon cloud FTW!

Related: Did dropbox have to fail?

3. Even faster

Airbnb’s nerds blog has a post showing it costing 25% of a Hadoop cluster, and getting 5x performance boost. That’s pretty darn impressive!

Flydata has done benchmarks showing 10x speedup.

Read: Are SQL Databases dead?

4. SQL Toolchains

***

Also: 5 core pieces of the Amazon cloud puzzle to get your project off the ground

5. Limitations

o data loading

You load data into Redshift using the COPY command. This command reads flat files from S3 and dumps them into tables. It can be extremely fast if you do things in parallel. However getting your data into those flat files is up to you.

There are a few solutions to this.

– amazon data pipeline

This is Amazon’s own toolchain, which allows you to move data from RDS & other Amazon hosted data sources. Data pipeline does not move data realtime, but in batch. Also it doesn’t take care of schema changes so you have to do that manually.

I mentioned it in my 5 reasons to move data to Amazon Redshift

– Flydata service

Flydata is a service with a monthly subscription which will connect to your RDS database, and move the data into Redshift. This seems like a no brainer, and given the heft pricetag of thousands per month, you’d expect it to cover your bases.

In my experience there are a lot of problems & it still required a lot of administration. When schema changes happen, those have to be carefully applied on Redshift. What’s more there’s no silver bullet around the datatype differences.

Also: Some thoughts on 12 factor apps

Flydata also makes use of the binary logs to replicate your data. Anything that doesn’t show up in the binary logs is going to cause you trouble. That includes when you do sql_log_bin=0 in the session, an SQL statement includes a no logging hint. Also watch out for replicate-ignore-db options in your my.cnf. But it also will fail if you use ON DELETE CASCADE. That’s because these downstream changes happen via Constraint in MySQL. But… drumroll please, Redshift doesn’t support ON DELETE CASCADE. In our case the child tables ended up with extra rows, and some queries broke.

– scripts such as Donors choose loader

Donors Choose has open sourced their nightly Redshift loader script. It appears to reload all data each night. This will nicely sidestep the ON DELETE CASCADE problem. As you grow though you may quickly hit a point where you can’t load the entire data set each night.

Their script sources from Postgres, though I’m interested to see if it can be modified for MySQL RDS.

– Tried & failed with Tungsten replicator

Theoretically Tungsten replicator can do the above. What’s more it seems like a tool custom made for such a use case. I tried for over a month to troubleshoot. I worked closely with the team to iron out bugs. I wrote wrestling with bears or how I tamed Tungsten replicator for MySQL and then I wrote a second article Tungsten replicator the good the bad & the ugly. Ultimately I did get some data moving between MySQL RDS & Redshift, however certain data clogged the system & it wouldn’t work for any length of time.

Also: Secrets of a happy Amazon hacker or how to lock down your account with IAM and multi-factor authentication

o data types & character sets

There are a few things here to keep in mind. Redshift counts bytes, so if in mysql or some other database you had a varchar(5) it may be varchar(20) in Redshift. Even then I had cases where it still didn’t fit & I had to make the field bigger by 4.

I also ran into problems around string character encodings. According to the docs Redshift handles 4-byte UTF-8.

Redshift doesn’t support ARRAYs, BIT, BYTEA, DATE/TIME, ENUM, JSON and a bunch of others. So don’t go into it expecting full Postgres support.

What you will get are multibyte characters, numeric, character, datetime, boolean and some type conversion.

Also: Is the difference between dev & ops a four-letter word?

o rebalancing

If and when you want to add nodes, expect some downtime. Yes theoretically the database is online while it’s shipping data to the new nodes & redistributing things, the latency can start to feel like an outage. What’s more it can easily push into the hours to do.

Also: Is AWS enabling startups which enable AngelList Syndicates to boil the VC business?

Get more. Grab our exclusive monthly Scalable Startups. We share tips and special content. Our latest Why I don’t work with recruiters