Tag Archives: etl

Is there a new better way to build a data warehouse in 2016?

redshift warehouse

In the old days… the bygone days of 2005 🙂 That was when you’d pony up for an Oracle license, get the hardware, and build your warehouse. Somewhere along the way you crossed your fingers.

Join 32,000 others and follow Sean Hull on twitter @hullsean.

Today everybody wants to treat data as a product. And for good reason. Knowing how to better server your customers & iterate more quickly is essential in todays hypercompetitive startup world.

1. Amazon Redshift enters the fray

Recently I’ve been wondering why is everyone suddenly talking about Amazon Redshift?? I ask not because recruiters are experts at database technology & predicting the industry trends, but rather because they have their finger on the pulse of what firms are doing.

Amazon launched Redshift in early 2013 using ParAccel technology. Adoption has been quick. Customers who already have their data in the AWS ecosystem find the offering a perfect match for their data analytics needs. And with stories swirling around of 10 hour MySQL reports running in under 60 seconds on Redshift, it’s no wonder.

Also: Is AWS too complex for small dev teams?

2. Old method – select carefully

Ralph Kimball’s opus having fully digested, you set out to meet with stakeholders, and figure out what you were building.

Of course no one understood your questions, and business units & engineering teams spoke english & french. Months went by, and things devolved. Morale got squashed. Eventually out the other end something would be built, nobody would be happy, and eyeballs would roll over the dollars spent.

This model was known in the data warehousing world by the wonderful acronym ETL which is short for extract, transform & load. The transform part happens before you load it. So that your warehouse is a shining, trimmed & manicured copy of your data, ready for reporting.

Also: Is Amazon too big to fail?

3. Today – mirror everything & then build views

Today you’re more likely to see the ELT model employed. That is Extract, Load & Transform. A subtle change, with big differences. When you load first, you mirror all of your transactional data into your warehouse, then build views or new summary tables to fit your ongoing needs.

Customers are using tools like Looker & Tableau to layer on top of these ELT warehouses which are also have some intelligence around the transform piece. This makes the process more self serve for business units, and requires less back & forth between engineering & product teams. No more waiting a few days for a report to be built, because these non-technical teams can build for themselves.

Also: When hosting data on Amazon turns bloodsport?

Is Data your dirty little secret?

4. Pipeline services

So you’re going down the ELT path, but how do get your data into Redshift? I wrote Five ways to get data into Redshift to answer that question.

There are a number of service based offerings from the point & click Fivetran to the more full featured Alooma. And then RJ Metrics & Flydata also fit the bill. You may also want to build your own with xplenty that also has a lot of ELT ETL logic you can build without code. Pretty spiffy.

Read: Is aws a patient that needs constant medication?

5. Reporting databases

We’ll be covering a lot lot more in this space, so check back.

Related: Does Amazon eat it’s own dogfood?

Get more. Grab our exclusive monthly Scalable Startups. We share tips and special content. Our latest Why I don’t work with recruiters

Extract Transform & Load – What is it and why is it important?

So-called ETL relates to moving data from external sources into and out of relational databases or data warehouses.


Source systems may store data in an infinite variety of formats.  Extracting involves getting that data into common files for moving to the destination system.  CSV file also known as comma separated values is named because each of the records is stored as one line in the file, and fields are separated by commas, and often surrounded by quotes as well.  In MySQL INTO OUTFILE syntax can perform this function.  If you have a lot of tables to work with, you can script the process using the data dictionary as a lookup for table names, and create a .mysql script to then run with the mysql shell.  In Oracle you would use the spool command in SQL*Plus the command line shell.  Spool sends subsequent output from the screen also to a file.


This step involves modifying the extracted data in preparation for moving it into the target database server.  It may involve sweeping out blank records, or rearranging columns, or breaking files into smaller subsets of data.  You might also map values differently for instance if one column in the source database was gender with values M/F you might transform those to the strings “Male” and “Female” if that is more useful for your target database server.  Or you might transform those to numerical values, for instance Male & Female might be 0/1 in your target database.

Although I myriad of high level GUI tools exist to perform these functions, the Unix operating system includes a plethora of very powerful tools that every experience System Administrator is familiar with.  Those include grep & sed which operate on regular expressions and can perform data transformation at lightening speed.  Then there is sort which can sort data and send the results to stdout or the file of your choosing.  Other tools include wc – word count, cut which can remove columns and so forth.

This final step involves moving the data into the database server, and it’s final target tables.  For instance in MySQL this might be done with the LOAD DATA INFILE syntax, while in Oracle you might use SQL*Loader, which is a very fast flat file dataloader.

Quora discussion by Sean Hull – What is ETL?