Is there a new better way to build a data warehouse in 2016?

redshift warehouse

In the old days… the bygone days of 2005 :) That was when you’d pony up for an Oracle license, get the hardware, and build your warehouse. Somewhere along the way you crossed your fingers.

Join 32,000 others and follow Sean Hull on twitter @hullsean.

Today everybody wants to treat data as a product. And for good reason. Knowing how to better server your customers & iterate more quickly is essential in todays hypercompetitive startup world.

1. Amazon Redshift enters the fray

Recently I’ve been wondering why is everyone suddenly talking about Amazon Redshift?? I ask not because recruiters are experts at database technology & predicting the industry trends, but rather because they have their finger on the pulse of what firms are doing.

Amazon launched Redshift in early 2013 using ParAccel technology. Adoption has been quick. Customers who already have their data in the AWS ecosystem find the offering a perfect match for their data analytics needs. And with stories swirling around of 10 hour MySQL reports running in under 60 seconds on Redshift, it’s no wonder.

Also: Is AWS too complex for small dev teams?

2. Old method – select carefully

Ralph Kimball’s opus having fully digested, you set out to meet with stakeholders, and figure out what you were building.

Of course no one understood your questions, and business units & engineering teams spoke english & french. Months went by, and things devolved. Morale got squashed. Eventually out the other end something would be built, nobody would be happy, and eyeballs would roll over the dollars spent.

This model was known in the data warehousing world by the wonderful acronym ETL which is short for extract, transform & load. The transform part happens before you load it. So that your warehouse is a shining, trimmed & manicured copy of your data, ready for reporting.

Also: Is Amazon too big to fail?

3. Today – mirror everything & then build views

Today you’re more likely to see the ELT model employed. That is Extract, Load & Transform. A subtle change, with big differences. When you load first, you mirror all of your transactional data into your warehouse, then build views or new summary tables to fit your ongoing needs.

Customers are using tools like Looker & Tableau to layer on top of these ELT warehouses which are also have some intelligence around the transform piece. This makes the process more self serve for business units, and requires less back & forth between engineering & product teams. No more waiting a few days for a report to be built, because these non-technical teams can build for themselves.

Also: When hosting data on Amazon turns bloodsport?

Is Data your dirty little secret?

4. Pipeline services

So you’re going down the ELT path, but how do get your data into Redshift? I wrote Five ways to get data into Redshift to answer that question.

There are a number of service based offerings from the point & click Fivetran to the more full featured Alooma. And then RJ Metrics & Flydata also fit the bill. You may also want to build your own with xplenty that also has a lot of ELT ETL logic you can build without code. Pretty spiffy.

Read: Is aws a patient that needs constant medication?

5. Reporting databases

We’ll be covering a lot lot more in this space, so check back.

Related: Does Amazon eat it’s own dogfood?

Get more. Grab our exclusive monthly Scalable Startups. We share tips and special content. Our latest Why I don’t work with recruiters

Are career promotions like marriage… appealing until your first divorce?

surge pricing engineers

I was recently flipping through an interesting email list. It’s focused for tech leaders, managers & startup entrepreneurs. An HR team lead posted asking about “promotion paths” for engineers.

While I have an intuitive grasp of what engineers at those different levels look like, I’m having trouble making those concrete.

Join 32,000 others and follow Sean Hull on twitter @hullsean.

It struck me how antiquated the whole “career ladder” concept is. Work one job for 20-30 years. It feels like the fairytale of dating that leads safely to marriage. It all seems like a wonderful plan until it fizzles out, employees get jaded, they start seeing the real money being paid elsewhere, and begin looking around.

1. Talent in short supply

I’m not a CTO.  I should preface with that bit.  I’m a consultant.  That said I’ve worked in the tech industry for 20 years, so I have a bit of an opinion here.

Going to meetups, startup industry & pitch events. They’re all like a feeding frenzy. There are more companies hiring now than I remember back in 1998 & 1999. It’s just crazy.

Angel List says 18,000 companies are hiring right now. What about Made In NYC? That shows 735 jobs. And of course there’s Ycombinator who is hiring April 2016, which posts every other month. It has 720 comments as of this writing.

Also: Why I don’t work with recruiters

2. Are salary jumps always larger through external promotion?

I’ve seen a pattern repeated over & over.  An outside firm offers more money & grabs the talent, or the talent gets restless, starts looking & finds they get a bigger bump in salary by leaving, than by internal promotions.  

I don’t know why this is, but it seems almost universal that salary jumps are larger from outside firms, than internally through promotion.  

Also: Why devops talent is so hard to find

3. Building a better ladder

There are great posts on engineering ladders like this one from Neo and also this one from RTR. Also take a look at this one at Artsy. And of course somebody has to go and put theirs up on github. :)

All the titles & internal shuffling in the world aren’t going to hide industry pay for long.  When an employee gets wise to their career & the skills marketplace, they’ll eventually learn that title does not equal compensation.

Related: How to hire a developer that doesn’t suck?

4. Building a better culture

In a pricey city like New York, the only thing that seems a counterweight to this is phenomenal culture, chance to build something cool & be surrounded by coworkers you love.  To be sure bouncing around you get less of this. Companies like Etsy comes to mind. According to glassdoor companies like Airbnb, Hubspot & facebook also fit the bill.

Read: 8 questions to ask an aws expert

5. Surge pricing for engineers?

Alternatively to better ladders & promotions, perhaps what Uber did for taxi driving would make sense for hiring engineers too. Let the freelancing phenomenon grow even bigger!

Perhaps we need surge pricing for engineers. That way the very best really do get rewarded the most. Let the marketplace work it’s magic.

Also: When you have to take the fall

Get more. Grab our exclusive monthly Scalable Startups. We share tips and special content. Our latest Why I don’t work with recruiters

Are engineering orgs like Google so different from sales driven ones like Oracle?

Editor & writer in friendly dialog

Over the years I’ve worked with over 100 different organizations. Two decades in the industry you see a lot of things. Some businesses are more engineering heavy, while others are more sales driven.

Join 32,000 others and follow Sean Hull on twitter @hullsean.

So this past week, I was somewhat surprised because I met with two very different organizations, and the contrast stood out dramatically to me. Pando Daily called it the Clash of Cultures.

I wonder will we ever learn from eachother?

1. On Monday I met with CloudOne

I’m choosing a fictional name here, but the meeting was real. We met over lunch to discuss how we might work together. Their org has been around for years, has a phenomenal track record, and they are strongly sales oriented.

Some observations:

o They’re hungry. They pushed for client lists & sniffed for leads.
o They’re margin oriented, they had a clear idea of where their strong suit was, and what types of customers they wanted to work with. That’s because they had a clear idea of their margins.
o They understand the industry well, much better than I did.
o They could certainly talk circles around me in terms of industry categories & verticals.
o They glossed over technical details
o They made broad generalizations & mixed up facts at times

Also: Beware the sales wolf in sheep suits

2. On Thursday I met with DataOne

Here again I’m choosing a fictional name. We met over dinner to discuss my opinions of the market and also if I might have any venture leads or could make introductions.

Some Observations I came away with:

o Their company is all engineering.
o They’re intimately focused on coding & building the product.
o They downplayed product limitations & somewhat out of touch with customer.
o They seemed to be feeling around in the dark for investors
o They seemed to have a weak network

Related: When you have to take the fall

3. Org experience: LearnOne

One of my past customers, also a fictional name here, they were also an incredibly sales heavy organization.

Some Observations:

o Their monthly standups felt like a sporting huddle.
o Lots of ra ra ra & high fives
o They were extremely sales driven, growing rapidly
o They had tremendous problems around engineering.
o They seemed to be boxing wayyy above their weight class.

Read: 5 Things I learned from Dvaid Maister about trust & advising clients

4. Cross-cultural studies

As a consultant I find this all fascinating. It often seems like this cultural style is driven from the top. The big movers are the ones who shape the organization.

I think of Google as an incredible example of an engineering driven organization. Finding top people is always about math & problem solving, but short on personality emphasis. Meanwhile their products lack the UI polish, but are functionally accurate & always fast.

Contrast that with Oracle, which send in a heavy armament of perfect suits to close a deal, negotiate soft until you’re firm is locked in, then jack up the license fees until you bleed. Meanwhile although the product is a sturdy technical construction, it’s every bit the marketing that is smooth & polished.

Also: Why is devops talent in short supply?

5. The takeaway

A winning team needs both. I’m obviously born of the engineering camp, but I agree with Ben Horowitz that the new enterprise customer is much like the old enterprise customer. And yes sales matters more than ever before.

At the same time the engineering team needs to carry equal weight, and decisions for both teams need to be framed as tradeoffs for the other.

Also: Five ways to build an analytics database with Amazon Redshift

Get more. Grab our exclusive monthly Scalable Startups. We share tips and special content. Our latest Why I don’t work with recruiters

Five ways to get your data into Redshift

redshift data pipeline

Everybody is hot under the collar this data over Redshift. I heard one customer say, a query that took 10 Hours before now finishes in under a minute. Without modification. When businesses see 600 times speedup, that can change the way they do business.

Join 32,000 others and follow Sean Hull on twitter @hullsean.

What’s more Redshift is easy to deploy. No complicated licenses like the Oracle days. No hardware, just create your cluster & go.

So you’ve made the decision, and you have data in your transactional database, MySQL RDS or Postgres. Now what?

Here are some systems that will help you synchronize data on the regular. And keep it in sync. Most of these are near real-time, so you can expect reports to be looking at the data your business created today.

1. RJ Metrics Pipeline

One of the simplest options, RJ Metrics Pipeline. Setup a trial account, configure your Redshift credentials in the warehouse section (port, user, password, endpoint) and save. Then configure your data source. For MySQL specify hostname, user, password & port. You get the option to go through an ssh tunnel for security. That’s good. You’ll also be given the grant code to create a user in MySQL for RJM.

rjmetrics table config screen

RJM uses a primary or unique key to figure out which rows have changed. Well that’s not completely true. Only if you’re using incremental refresh. If you’re using complete refresh, then it just selects all the data & replaces it each time.

The user interface is a bit clunky. You have to go in and CONFIGURE EACH TABLE you want to replicate. There’s no REPLICATE-ALL option. This is a pain. If you have 500 tables, it might take hours to configure them all.

Also since RJM isn’t CDC (change data capture) based, it won’t be as close to real-time as some of the other options.

Still RJM works and it’s pretty point-n-click.

Also: Is Amazon too big to fail?

2. xplenty

xplenty is really a lot more than just a sync tool. It’s a full featured ETL system. Want to avoid writing tons of python jobs to convert datatypes, transform 0 to paid & 1 to free, things like that? Well xplenty is made to allow building ETL systems without code.

xplenty main dashboard

It’s a bit complex to setup at first, but very full featured. It is the DIY developer or DBAs tool of the bunch. If you need hardcore functionality, xplenty seems to have it.

Also: When hosting data on Amazon turns bloodsport?

Is Data your dirty little secret?

3. Alooma

Alooma might possibly be the most interesting of the bunch.

After a few stumbles during the setup process, we managed to get this up and running smoothly. Again as with xplenty & Fivetran, it uses CDC to grab changes from the MySQL binlogs. That means you get near realtime.

alooma dashboard

Although it’s a bit more complex to setup than Fivetran, it gives you a lot more. There’s excellent visibility around data errors, which you *will* have. Knowing where they happen, means your data team can be very proactive. This is great for the business.

What’s more there is a python based Code Engine which allows you to write bits of code that transform data in the pipeline. That’s huge! Want to do some simple ETL, this is a way to do that. Also you can send notifications, or requeue events. All this means you get state of the art pipeline, with good configurability & logging.

Read: Is aws a patient that needs constant medication?

4. Fivetran

Fivetran is super point-n-click. It is CDC based like Flydata & Alooma, so you’re gonna get near realtime sync with low overhead. It monitors your binlogs for changed data, and ships it to Redshift. No mess.

The dashboard is simple, the setup is trivial, and it just seems to work. Least pain, best bang.

Related: Does Amazon eat it’s own dogfood?

5. Other options

There are lots of other ways to get data into Redshift.

Flydata

I did manage to get Flydata working at a customer last year. It’s a very viable option. I wrote at length about that solution I’ll leave you to read all about it there.

AWS Data Pipeline

I’ve started to kick the tires of AWS Data Pipeline but haven’t decided if it’s the best option for customers.

Nightly rebuild

The Donors Choose Tech Blog posted about their project which can move data from postgres to redshift. You can find the project here.

This will do a *full* reload each night, so if your db is too big for that, it might need modifications. Also if you’re using MySQL as source db you’ll need to change code. One thing I found in there was Perl & Sed commands to transform your source schema CREATE & ALTER statements into Redshift compatible ones. That in itself is worth a look.

Lambda to the rescue

The awslabs github team has put together a lambda-based redshift loader. This might be just what you need. Remember thought that’ll you’ll need to deliver your source data in CSV files to S3 on the regular. So you’ll need some method to dump it. But if you have that half of the equation, this is ideal.

Data Migration Serve or DMS

This appears to have supported Redshift early on, but does not appear to do so now. I’ve gotten conflicting reports, so I should dig a bit more. Anybody want to comment on this one?

Tungsten

I tried & tried & tried to get Tungsten to work. I did have some success but was still blocked by data problems which remained unresolved. To my mind the project is still broken or at least very buggy.

Also: Is AWS too complex for small dev teams?

Get more. Grab our exclusive monthly Scalable Startups. We share tips and special content. Our latest Why I don’t work with recruiters

Locking down cloud systems from disgruntled engineers

medieval gate fortified aws

I worked at a customer last year, on a short term assignment. A brilliant engineer had built their infrastructure, automated deployments, and managed all the systems. Sadly despite all the sleepless nights, and dedication, they hadn’t managed to build up good report with management.

Join 32,000 others and follow Sean Hull on twitter @hullsean.

I’ve seen this happen so many times, and I do find it a bit sad. Here’s an engineer who’s working his butt off, really wants the company to succeed. Really cares about the systems. But doesn’t connect well with people, often is dismissive, disrespectful or talks down to people like they’re stupid. All burns bridges, and there’s a lot of bad feelings between all parties.

How to manage the exit process. Here’s a battery of recommendations for changing credentials & logins so that systems can’t be accessed anymore.

1. Lock out API access

You can do this by removing the administrator role or any other role their IAM user might have. That way you keep the account around *just in case*. This will also prevent them from doing anything on the console, but you can see if they attempt any logins.

Also: Is AWS too complex for small dev teams?

2. Lock out of servers

They may have the private keys for various serves in your environment. So to lock them out, scan through all the security groups, and make sure their whitelisted IPs are gone.

Are you using a bastion box for access? That’s ideal because then you only have one accesspoint. Eliminate their login and audit access there. Then you’ve covered your bases.

Related: Does Amazon eat it’s own dogfood?

3. Update deployment keys

At one of my customers the outgoing op had setup many moving parts & automated & orchestrated all the deployment processes beautifully. However he also used his personal github key inside jenkins. So when it went to deploy, it used those credentials to get the code from github. Oops.

We ended up creating a company github account, then updating jenkins with those credentials. There were of course other places in the capistrano bits that also needed to be reviewed.

Read: Is aws a patient that needs constant medication?

4. Dashboard logins

Monitoring with NewRelic or Nagios? Perhaps you have a centralized dashboard for your internal apps? Or you’re using Slack?

Also: Is Amazon too big to fail?

5. Non-key based logins

Have some servers outside of AWS in a traditional datacenter? Or even servers in AWS that are using usernames & passwords? Be sure to audit the full list of systems, and change passwords or disable accounts for the outgoing sysop.

Also: When hosting data on Amazon turns bloodsport?

Get more. Grab our exclusive monthly Scalable Startups. We share tips and special content. Our latest Why I don’t work with recruiters