Tag Archives: redshift

How to build an operational datastore on AWS with S3 & Redshift


You’re building your data warehouse, and getting data into Redshift. You’ve got your ETL pipeline running, and presentation layer talking to the warehouse. Great.

But how to get access to that source data? Wouldn’t it be nice if that was close by too?

Join 35,000 others and follow Sean Hull on twitter @hullsean.

It may be you have 10-zillion rows of source data and don’t want or need to get all of that into Redshift and keep it there. But it would be nice to have access to it when you do.

Enter EXTERNAL tables, aka Spectrum. Now you can keep all your raw data in S3, an in place operational datastore of data before it’s been reworked and transformed. Use SQL to access it right where it sits.

Get all the advantages of lifecycle management in S3, and don’t pay all the redshift costs for data you don’t need all the time. Cool!

Let’s see how it works.

What is an EXTERNAL table?

Spectrum is Amazon’s rebranding of an old database technology called EXTERNAL TABLES. Back in the 90’s Oracle pioneered this work, allowing you to essentially map a CSV file, that sits outside the database proper. This means you can query all that juicy data sitting in flat files. Cool!

Athena allows you to query this stuff as a service, native to AWS. Spectrum allows you to create those external tables inside of Redshift.

Also: Top serverless interview questions for hiring aws lambda experts

Give Redshift permissions

Go into IAM and create a new role called “SeanSpectrumRole”. Assign the policy AmazonS3ReadOnlyPolicy. It looks like this:

"Version": "2012-10-17",
"Statement": [
"Effect": "Allow",
"Action": [
"Resource": "*"

If you’re using the dashboard you just pick the policy from the named list. However if you’re using CloudFormation, you’ll use the code above.

Now navigate your aws console to the Redshift dashboard, click clusters, and click the checkbox for your cluster. Probably there’s only one.

Now click the “Manage IAM Roles” button, and a dialog should popup.. Select the role you created earlier, SeanSpectrumRole. Then click “Apply Changes”.

The beauty of the AWS world is that servers themselves can have API permissions. In this case we gave the redshift cluster or server itself, access to S3 for our use below!

Related: Which engineering roles are in greatest demand?

Create your spectrum schema

First you must create a spectrum schema. Here’s the syntax:

create external schema spectrum
from data catalog
database 'sean'
region 'us-east-1'
iam_role 'arn:aws:iam::9999999999999:role/SeanSpectrumRole';

Read: Can on-demand consulting save startups time & money?

Upload your data to S3 bucket

Here we create an s3 bucket called sean_spectrum, then upload one csv file named sean_numbers.txt.

$ aws s3api create-bucket --bucket sean_spectrum --region us-east-1
"Location": "/sean_spectrum"
$ cd spectrum/
$ cat sean_numbers.txt
$ aws s3 cp sean_numbers.txt s3://sean_spectrum/
upload: ./sean_numbers.txt to s3://sean_spectrum/sean_numbers.txt
$ aws s3 ls s3://sean_spectrum/
2017-05-18 20:28:41 193 sean_numbers.txt

Note the names. The table name won’t turn out to be sean_numbers. It will be called sean_spectrum, and all files inside that directory will be queried. So make sure they have consistent formats!

Also: 30 questions to ask a serverless fanboy

Create & query your external table

Here’s how you create your external table. Note this is just a map to data. The data is still stored in S3. it is not brought into Redshift except to slice, dice & present.

mydb=# create external table spectrum_schema.sean_numbers(
id int,
fname string,
lname string,
phone string)
row format delimited
fields terminated by ','
stored as textfile
location 's3://sean_spectrum/';

Here’s how you query it:

mydb=# select * from spectrum_schema.sean_numbers order by id;
id | fname | lname | phone
12 | Scarlet | Witch | 23-35-47-555
15 | Wonder | Woman | 999-324-7878
21 | Dr. | Who | 44-22-55-77-88
32 | Quick | Silver | 22-33-77-99
35 | Bat | Man | 317-222-4777
75 | Marvel | Girl | 222-333-9595
99 | Storm | Cloud | 367-399-6767

Cool. We reordered data read from an S3 file!!!

Although you can’t create a view over a redshift table *AND* an S3 external table, you can query them together.

So for example if I have a table in redshift with addresses, I can join them together:

mydb=# select a.id, a.fname, a.lname, b.address from spectrum_schema.sean_numbers a, sean_addresses b
where a.id = b.id order by id;

id | fname | lname | phone | address
12 | Scarlet | Witch | 23-35-47-555 | 10 main st
15 | Wonder | Woman | 999-324-7878 | 25 center st
21 | Dr. | Who | 44-22-55-77-88 | 32 broadway
32 | Quick | Silver | 22-33-77-99 | 1 first st
35 | Bat | Man | 317-222-4777 | 99 west st
75 | Marvel | Girl | 222-333-9595 | 66 East Ave
99 | Storm | Cloud | 367-399-6767 | 50 North st

Also: What can startups learn from the DYN DNS outage?

Get more. Grab our exclusive monthly Scalable Startups. We share tips and special content. Our latest Why I don’t work with recruiters

Will SQL just die already?

With tons of new No-SQL database offerings everyday, developers & architects have a lot of options. Cassandra, Mongodb, Couchdb, Dynamodb & Firebase to name a few.

Join 33,000 others and follow Sean Hull on twitter @hullsean.

What’s more in the data warehouse space, you have Hadoop, which can churn through terabytes of data and get you results back before lunchtime!

So when I stumbled on this article SQL is 43 years old, I was intrigued.

Answer the questions you haven’t thought of

No-SQL databases are great if you know how you want to access the data. Users come from the users table, and that’s that!

But if later on you want to ask questions like, which users watched this video, which users are active, which users spent $100 in January? These questions may not be possible because NoSQL can’t join those other tables.

Relational databases shine when you need to aggregate your data, reorganize it, or ask unanticipated questions. And aren’t those most of the interesting questions?

Also: Top serverless interview questions for hiring aws lambda experts

Big Query, Redshift & even Hive speak SQL

I wrote that despite recent popularity in Hadoop, Redshift seems to be eating their lunch. And what would you know, surprise surprise, Amazon’s newish data warehousing solution, speaks SQL! What’s more there’s Apache Hive, which allows you to query Hadoop with, drumroll please… SQL!

Bigquery is the other major bigdata offering from none other than Google. And it too uses SQL!

Related: Which engineering roles are in greatest demand?

Still dominant

If you look at Stackoverflow’s developer survey, you’ll see that SQL is the second most popular language. Why might that be? For one thing it’s simple to learn. Enough that even business users can write simple requests, join & aggregate data.

Read: Can on-demand consulting save startups time & money?

Rugged, Proven & Open

SQL having been around so long is a fairly open standard. Sure there are extensions of it, but most of the basic stuff is there in all the products. That means you learn it once, and can interact with databases across the spectrum. That’s a win for everybody.

Also: 30 questions to ask a serverless fanboy

Business users can write it

Another under appreciated feature though is that basic queries are easy to write. They don’t require complex syntax like a hadoop job, or your favorite imperative programming language. The queries are readable, almost english-like sentences.

Given all that, it seems SQL is likely to be around for a long time to come!

Also: What can startups learn from the DYN DNS outage?

Get more. Grab our exclusive monthly Scalable Startups. We share tips and special content. Our latest Why I don’t work with recruiters

Is Amazon about to disrupt your data warehouse?


Amazon is about to launch a product called glue. As you can see below, this is the last piece in the data warehousing puzzle. With that in place, Amazon will own you! Or at least have push button products to meet all of enterprises varying needs.

Even if you’re a small startup, you can do big-shot big enterprise data warehousing. That means everyone can use cutting edge data driven techniques for product & business decisions.

Join 33,000 others and follow Sean Hull on twitter @hullsean.

What is Redshift

Redshift is like the OLAP databases of years past, the Oracle’s of the world purpose built for warehousing data. Obviously without the crazy licensing model Oracle was famous for. With Amazon you can get enterprise class data warehouse for modest hourly prices.

If my recent conversations with recruiters about Redshift demand are any indication, there’s been a sudden uptick in startups looking for redshift expertise.

Also: Top serverless interview questions for hiring aws lambda experts

What is Spectrum?

Spectrum is a very new extension of Redshift allowing you to access & query S3 file data directly. This means you can have petabytes of data that you can access pre-load time. So you will ETL and load portions of it, but with Spectrum you can still access the offline data too.

In the old Oracle days this was called an EXTERNAL TABLE. I mention this only to say that Amazon isn’t doing anything that hasn’t been done before. Rather they’re bringing these advanced features within reach of everyday startups. That’s cool.

Related: Which engineering roles are in greatest demand?

What is glue?

Glue is still in beta, but if the RE:Invent talk above is any indication, it’s set to disrupt an entire industry. Wow!

Glue first catalogs your data sources. What does this mean, it scans them & models their schemas.

It then generates sample python ETL code. Modify it, or write your own. Share your code on Git. Or borrow other open source pieces, that already address your specific ETL use case!

Lastly it includes a job scheduler which handles dependencies. Job A must be completed before B can run and so forth. Error handling & logging are also all included.

Since these are native Amazon services, of course they’re going to integrate with their dangerously fast Redshift warehouse.

Read: Can on-demand consulting save startups time & money?

What is serverless?

I’ve written about how to throw fastballs at a serverless fanboy and even how to hire a serverless expert. But really what is it?

Serverless means deploying functions directly into the cloud. No servers, no configuration. All the systems administration & automation is hidden. No more devops to argue with! Amazon’s own offering is called Lambda.

Also: 30 questions to ask a serverless fanboy

What is Quicksight?

Amazon’s even jumped into the fray at the presentation layer. Quicksight is a BI tool along the lines of mode, domo, looker or Tableau.

Now it’s possible to stay completely within the cozy Amazon ecosystem even for business insight and analytics.

Also: What can startups learn from the DYN DNS outage?

Get more. Grab our exclusive monthly Scalable Startups. We share tips and special content. Our latest Why I don’t work with recruiters

What products & improvements are new on AWS?

Amazon is releasing new products & services to it’s global cloud compute network at a rate that has all of our heads spinning.

Join 32,000 others and follow Sean Hull on twitter @hullsean.

Here’s new stuff worth mentioning around databases & data.


Moving data from your transactional MySQL or Arora database to your reporting database isn’t always easy.

In the past you could use a service like xplenty or Alooma.

Now Amazon themselves are getting into the ETL game, providing a new service called Glue.

Also: RDS or Mysql? 10 use cases

2. Query S3 with Athena

Chances are if you’re using AWS for anything, you’ve got data in S3. And wouldn’t it be nice to pick that apart and dig through it, where it sits?

Oracle had a feature called “external tables” and MySQL had something similar. Now Amazon is offering that native within it’s own cloud universe. Thanks to some tricky lambda code, now you can do that. Don’t worry how they did it, because it’s been packaged into a nice easy service for your use!

Related: When you have to take the fall – consulting war stories

3. Business Intelligence with QuickSight

If you’re a data driven startup, and who isn’t these days, you’re going to have a business unit building reports. Tableau or Looker may be in your wheelhouse.

Amazon is obviously seeing the opportunity here, and competing with their own partners. Check out Amazon Quicksight for details.

Read: Is upgrading RDS like a sh*t storm that will not end?

4. Expanded RDS

RDS is obviously a very popular offering. And even though zero downtime is very hard to achieve with RDS, you’ll save plenty on DBAs and admins you don’t have to hire!

If you hadn’t heard, there is now MariaDB support. And with it, there’s a migration from MySQL to Mariadb as well.

Using Mariadb may bring you performance advantages & improvements. But RDS may mitigate this by productize & standarizing things.

You can also now move encrypted snapshots across regions. In my view this isn’t really a new feature, but rather fixing something that was broken before. The previous limitation was really more a symptom of their global network of data centers, than any built feature per se.

Also: Is the difference between dev & ops a four-letter word?

5. Expanded Redshift

As I’ve blogged before, everybody is excited about Redshift these days.

Amazon has introduced some new features.

o better loading of sorted data

This is done behind the scenes to load data quickly, and keep it stored efficiently. No more vacuuming after a big load!

o user & database rate limiting

Limit connections on a per user or per database level. Useful!

o storage estimates on analyze

When you perform the analyze command, you can get storage information so it’s easier to decide datatypes & compression type. Nifty!

Also: Is Redshift outpacing Hadoop as the big data warehouse for startups?

Get more. Grab our exclusive monthly Scalable Startups. We share tips and special content. Our latest Why I don’t work with recruiters

5 tech challenges I’m thinking about today

fast fish

Technical operations & startup tech are experiencing an incredible upheaval which is bringing a lot of great things.

Join 32,000 others and follow Sean Hull on twitter @hullsean.

Here are some of the questions it raises for me.

1. Are we adopting Docker without enough consideration?

Container deployments are accelerating at a blistering pace. I was reading Julian Dunn recently, and he had an interesting critical post Are container deployments like an oncoming train?

He argues that we should be wary of a few trends. One of taking legacy applications and blindly containerizing them. Now we can keep them alive forever. 🙂 He also argues that there is a tendency for folks who aren’t particularly technical or qualified who start evangelizing it everywhere. A balm for every ailment!

Also: Is Amazon too big to fail?

2. Is Redshift supplanting hadoop & spark for startup analytics?

In a recent blog post I asked Is Redshift outpacing hadoop as the big data warehouse for startups.

On the one hand this is exciting. Speed & agile is always good right? But what of more Amazon & vendor lock-in?

Related: Did Dropbox have to fail?

3. Does devops automation make all of operations a software development exercise?

I asked this question a while back on my blog. Is automation killing old-school operations?

Automation suites like Chef & Puppet are very valuable, in enabling the administration of fleets of servers in the cloud. They’re essential. But there’s some risk in moving further away from the bare metal, that we might weaken our everyday tuning & troubleshooting skills that are essential to technical operations.

Read: When hosting data on Amazon turns bloodsport?

4. Is the cloud encouraging the old pattern of throwing hardware at the problem?

Want to scale your application? Forget tighter code. Don’t worry about tuning SQL queries that could be made 1000x faster. We’re in the cloud. Just scale out!

That’s right with virtualization, we can elastically scale anything. Infinitely. 🙂

I’ve argued that throwing hardware at the problem is like kicking the can down the road. Eventually you have to pay your technical debt & tune your application.

Also: Are SQL databases dead?

5. Is Amazon disrupting venture capital itself?

I’m not expert on the VC business. But Ben Thompson & James Allworth surely are. And they suggested that because of AWS, startups can setup their software for pennies.

This resonates loud & clear for me. Why? Because in the 90’s I remember startups needing major venture money to buy Sun hardware & Oracle licenses to get going. A half million easy.

They asked Is Amazon Web Services enabling AngelList syndicates to disrupt the Venture capital business? That’s a pretty interesting perspective. It would be ironic if all of this disruption that VC’s bring to entrenched businesses, began unravel their own business!

Also: Are we fast approaching cloud-mageddon?

Get more. Grab our exclusive monthly Scalable Startups. We share tips and special content. Our latest Why I don’t work with recruiters

Is Redshift outpacing Hadoop as the big data warehouse for startups?

redshift hadoop killer

More and more startups are looking at Redshift as a cheaper & faster solution for big data & analytics.

Join 32,000 others and follow Sean Hull on twitter @hullsean.

Saggi Neumann posted a pretty good side-by-side comparison of Redshift & Hadoop and concluded they were tied based on your individual use case.

Meanwhile Bitly engineering concluded Redshift was much easier.

1. More agile

One thing pointed out by the bitly blog post, which I’ve seen countless times, is the slow iteration cycle. Write your map-reduce job, run, test, debug, then run on your cluster. Wait for it to return and you might feel like you’re submitting a stack of punched cards. LOL Resolve the errors that come back and then rerun on your cluster. Over & over & over again.

With Redshift you’re writing SQL, so your iterating through syntax errors quickly. What’s more since Redshift is a column-compressed database, you can do full table scans on columns without indexes.

What that means for you and me is that queries just run. And they run blazingly fast!

Also: When hosting data on Amazon turns bloodsport

2. Cheap

Redshift is pretty darn cheap.

Saggi’s article above quotes Redshift at $1000/TB/yr for reserved, and $3700/TB/yr for on-demand. This compared with a hadoop cluster at $5000/TB/yr.

But neither will come with spitting distance of the old-world of Oracle, where customers host big iron servers in their own datacenter, paying north of a million dollars between hardware & license costs. Amazon cloud FTW!

Related: Did dropbox have to fail?

3. Even faster

Airbnb’s nerds blog has a post showing it costing 25% of a Hadoop cluster, and getting 5x performance boost. That’s pretty darn impressive!

Flydata has done benchmarks showing 10x speedup.

Read: Are SQL Databases dead?

4. SQL Toolchains


Also: 5 core pieces of the Amazon cloud puzzle to get your project off the ground

5. Limitations

o data loading

You load data into Redshift using the COPY command. This command reads flat files from S3 and dumps them into tables. It can be extremely fast if you do things in parallel. However getting your data into those flat files is up to you.

There are a few solutions to this.

– amazon data pipeline

This is Amazon’s own toolchain, which allows you to move data from RDS & other Amazon hosted data sources. Data pipeline does not move data realtime, but in batch. Also it doesn’t take care of schema changes so you have to do that manually.

I mentioned it in my 5 reasons to move data to Amazon Redshift

– Flydata service

Flydata is a service with a monthly subscription which will connect to your RDS database, and move the data into Redshift. This seems like a no brainer, and given the heft pricetag of thousands per month, you’d expect it to cover your bases.

In my experience there are a lot of problems & it still required a lot of administration. When schema changes happen, those have to be carefully applied on Redshift. What’s more there’s no silver bullet around the datatype differences.

Also: Some thoughts on 12 factor apps

Flydata also makes use of the binary logs to replicate your data. Anything that doesn’t show up in the binary logs is going to cause you trouble. That includes when you do sql_log_bin=0 in the session, an SQL statement includes a no logging hint. Also watch out for replicate-ignore-db options in your my.cnf. But it also will fail if you use ON DELETE CASCADE. That’s because these downstream changes happen via Constraint in MySQL. But… drumroll please, Redshift doesn’t support ON DELETE CASCADE. In our case the child tables ended up with extra rows, and some queries broke.

– scripts such as Donors choose loader

Donors Choose has open sourced their nightly Redshift loader script. It appears to reload all data each night. This will nicely sidestep the ON DELETE CASCADE problem. As you grow though you may quickly hit a point where you can’t load the entire data set each night.

Their script sources from Postgres, though I’m interested to see if it can be modified for MySQL RDS.

– Tried & failed with Tungsten replicator

Theoretically Tungsten replicator can do the above. What’s more it seems like a tool custom made for such a use case. I tried for over a month to troubleshoot. I worked closely with the team to iron out bugs. I wrote wrestling with bears or how I tamed Tungsten replicator for MySQL and then I wrote a second article Tungsten replicator the good the bad & the ugly. Ultimately I did get some data moving between MySQL RDS & Redshift, however certain data clogged the system & it wouldn’t work for any length of time.

Also: Secrets of a happy Amazon hacker or how to lock down your account with IAM and multi-factor authentication

o data types & character sets

There are a few things here to keep in mind. Redshift counts bytes, so if in mysql or some other database you had a varchar(5) it may be varchar(20) in Redshift. Even then I had cases where it still didn’t fit & I had to make the field bigger by 4.

I also ran into problems around string character encodings. According to the docs Redshift handles 4-byte UTF-8.

Redshift doesn’t support ARRAYs, BIT, BYTEA, DATE/TIME, ENUM, JSON and a bunch of others. So don’t go into it expecting full Postgres support.

What you will get are multibyte characters, numeric, character, datetime, boolean and some type conversion.

Also: Is the difference between dev & ops a four-letter word?

o rebalancing

If and when you want to add nodes, expect some downtime. Yes theoretically the database is online while it’s shipping data to the new nodes & redistributing things, the latency can start to feel like an outage. What’s more it can easily push into the hours to do.

Also: Is AWS enabling startups which enable AngelList Syndicates to boil the VC business?

Get more. Grab our exclusive monthly Scalable Startups. We share tips and special content. Our latest Why I don’t work with recruiters

Replicate MySQL to Amazon Redshift with Tungsten: The good, the bad & the ugly

tungsten replicator

Heterogenous replication involves moving data from one database platform to another. This is a complicated endevour because datatypes, date & time formats, and a whole lot more tend to differ across platforms. In fact it’s so complex many enterprises simply employ a commercial solution to take away the drudgery.

Join 31,000 others and follow Sean Hull on twitter @hullsean.

Enter Tungsten, which supports these types of deployments, on platforms as Postgresql, Mongodb, Oracle, Redshift, Vertica. With custom built appliers the field is infinite!

With that I’ve set out to get things working with Amazon Redshift. If you’re still struggling with the basics check out Wrestling with bears or how I tamed Tungsten Replicator.

1. Connect to redshift

The first thing you’ll need to do is allow your Tungsten boxes to reach redshift. Seems obvious, but when you’re juggling all these apples & oranges for the first time, it may slip you mind.

Configure your AWS security group to allow tungsten boxes

Get the external IP address of your tungsten box. If it’s in DNS this will work even if ping doesn’t.

$ ping tungsten01.mydomain.net

Add to your Redshift security config. I created a special group called Tungsten and added the two tungsten boxes by IP address. That’s because these machines were on a different AWS account. If they’re on the same account, you could allow the entire EC2 group, and be done.

Install psql client

The best way I found to test the connection was psql. Install that:

$ apt-get install postgresql-client

Verify your connection:

$ psql -p 5439 -h --username=root -d dwh

Also: Are SQL Databases dead?

2. Configure S3 access

Tungsten uses S3 heavily to move data into Redshift.

(I outlined this previously in 5 Reasons to move data to Amazon Redshift.

Install s3tools package

Tungsten uses the s3cmd to interface with the Amazon S3 API. Let’s install that:

$ apt-get install s3cmd

Now edit the .s3cfg file of tungsten user. Change

access_key = AAAAAAA
secret_key = BBBBBBB

Lastly edit the tungsten /opt/continuent/share/s3-config-redshift.json. There are four parameters.

"awsS3Path" : "s3://tungstenbucket",
"awsAccessKey" : "AAAAAAA",
"awsSecretKey" : "BBBBBBB",
"cleanUpS3Files" : "false",

Related: Is Oracle killing MySQL?

3. Create tables on Redshift

In a heterogenous environment, that is where source and destination databases are different platforms, Tungsten cannot create tables for you.

It will however, give you a helping hand in the process. Enter the ddlscan tool, which scans the CREATE TABLE statements on your source database, and generates them for your target platform.

For each table in source database, there will be a stage table in Redshift:

$ ddlscan jdbc:mysql://localhost:3306/test -user sync -db test -template ddl-mysql-redshift-staging.vm > test_stage.sql

$ cat test_stage.sql
SQL generated on Thu Jun 04 20:06:45 UTC 2015 by ./ddlscan utility of Tungsten

url = jdbc:mysql:thin://tungsten01.mydomain.net:3306/test?jdbcCompliantTruncation=false&zeroDateTimeBehavior=convertToNull&tinyInt1isBit=false&allowMultiQueries=true&yearIsDateType=false
user = sync
dbName = test


DROP TABLE test.stage_xxx_sean;
CREATE TABLE test.stage_xxx_sean
tungsten_opcode CHAR(2),
tungsten_seqno INT,
tungsten_row_id INT,
tungsten_commit_timestamp TIMESTAMP,
c1 VARCHAR(256) /* VARCHAR(64) */,
id INT,
PRIMARY KEY (tungsten_opcode, tungsten_seqno, tungsten_row_id)

And also a base table in redshift:

$ ddlscan jdbc:mysql://localhost:3306/test -user sync -db test -template ddl-mysql-redshift.vm > test.sql

$ cat test.sql
SQL generated on Thu Jun 04 20:06:51 UTC 2015 by ./ddlscan utility of Tungsten

url = jdbc:mysql:thin://tungsten01.mydomain.net:3306/test?jdbcCompliantTruncation=false&zeroDateTimeBehavior=convertToNull&tinyInt1isBit=false&allowMultiQueries=true&yearIsDateType=false
user = sync
dbName = test


DROP TABLE test.sean;
CREATE TABLE test.sean
c1 VARCHAR(256) /* VARCHAR(64) */,
id INT,

Lastly apply those scripts to your redshift database:

$ psql
dwh# \i file_stage.sql
dwh# \i file_table.sql

Read: Are we fast approaching cloud-mageddon?

4. Troubleshoot applier

Encountered “Delimiter Not Found” issue

This issue was mysterious and remains so a bit. What I did to fix it:

had an issue with the path, but fixed that:

  "awsS3Path" : "s3://tungstenbucket",

It was causing an interim bucket to be created. But that did not solve things.

Ok. So I hacked this a bit.

Anyone can help me troubleshoot what happened & why?

A. I skipped transactions

I brought the applier back online with this command.

trepctl -service redshift online -skip-seqno 1,1-100

B. I did lots of inserts & deletes on MySQL

I then did about 200 of these:

mysql> insert into test.sean values ('hi there', 20);
mysql> delete from test.sean where id = 20;

C. Now seeing data

dwh=# select * from test.sean;
                 c1                  | id 
 working......                       | 25
 hello sean i have an exclamation !! | 27
 hello sean i came from mysql        | 26
(3 rows)

I also set cleanupS3Files to false. Now I’m seeing files like this:

So that indicates all those INSERT followed by DELETES cleaned up things.

Also: How do I find entrepreneurial focus?

5. Test data & table changes

B. Tested INSERT

At first the csv files were getting cleanedup by Tungsten. I added this option to s3-config-redshift.json file:

"cleanUpS3Files" : "false",

Then the files are kept around so we can review them. An insert record shows up in S3 like this:

"I","417","1","2015-06-05 17:44:35.000","tungsten new csv file? ","33",null

C. Tested DELETE

A DELETE record shows up in S3 like this:

"D","419","1","2015-06-05 17:45:48.000",null,"26",null

D. Tested UPDATE

An UPDATE record shows up in S3 like this:

"D","420","1","2015-06-05 17:48:55.000",null,"31",null
"I","420","2","2015-06-05 17:48:55.000","changed message text for redshift+tungsten update","31",null


As mentioned previously, this is *NOT* supported. However after doing the ALTER, the applier does *NOT* go offline. Also there are no errors. That’s because Tungsten does not support these and will filter them in a heterogenous environment.

The applier *DOES* go offline, after you try a new INSERT. That’s because it gets a new record for INSERT that doesn’t match.

“trepctl status” shows the following:

pendingExceptionMessage: CSV loading failed: schema=test table=sean CSV file=/tmp/staging/redshift/staging0/test-sean-413.csv message=Wrapped org.postgresql.util.PSQLException: ERROR: Load into table ‘stage_xxx_sean’ failed. Check ‘stl_load_errors’ system table for details. (../../tungsten-replicator//samples/scripts/batch/redshift.js#145)

redshift# alter table test.sean add column c3 integer default null;

redshift# alter table test.stage_xxx_sean add column c3 integer default null;

Then I brought the applier back online:

$ trepctl -service redshift online

Then check the status. It should say ONLINE for state.

$ trepctl status
Processing status command...
---- -----
appliedLastEventId : mysqld-bin.000022:0000000000000566;-1
appliedLastSeqno : 424
appliedLatency : 300585.739
autoRecoveryEnabled : false
autoRecoveryTotal : 0
channels : 1
clusterName : redshift
currentEventId : NONE
currentTimeMillis : 1433878195573
dataServerHost : my-dw.aaaa.us-east-1.redshift.amazonaws.com
extensions :
host : my-dw.aaaa.us-east-1.redshift.amazonaws.com
latestEpochNumber : 0
masterConnectUri : thl://tungsten01.mydomain.net:2112/
masterListenUri : null
maximumStoredSeqNo : 424
minimumStoredSeqNo : 0
offlineRequests : NONE
pendingError : NONE
pendingErrorCode : NONE
pendingErrorEventId : NONE
pendingErrorSeqno : -1
pendingExceptionMessage: NONE
pipelineSource : thl://tungsten01.mydomain.net:2112/
relativeLatency : 304511.573
resourcePrecedence : 99
rmiPort : 10000
role : slave
seqnoType : java.lang.Long
serviceName : redshift
serviceType : local
simpleServiceName : redshift
siteName : default
sourceId : my-dw.aaaa.us-east-1.redshift.amazonaws.com
state : ONLINE
timeInStateSeconds : 351940.007
timezone : GMT
transitioningTo :
uptimeSeconds : 600921.759
useSSLConnection : false
version : Tungsten Replicator 4.0.0 build 18
Finished status command...

Lastly, let’s see what’s in the table, fire up the postgresql shell and take a look:

dwh=# select * from test.sean;
c1 | id | c3
working...... | 25 |
hello sean i have an exclamation !! | 27 |
hello will i break? | 30 |
some more records | 32 |
tungsten new csv file? | 33 |
another tungsten csv file? | 34 |
changed message text for redshift+tungsten update | 31 |
(7 rows)

Also: Was Fred Wilson wrong about Apple?

Get more. Grab our exclusive monthly Scalable Startups. We share tips and special content. Our latest Why I don’t work with recruiters

Business Agility at AWS re:Invent

Also find Sean Hull’s ramblings on twitter @hullsean.

Although I couldn’t be in Vegas to attend re:Invent, there is so much online it’s almost better than being at the conference. From an ongoing live stream of keynotes and sessions, to an archived collection on Youtube.

The big wins

You may have heard of all the great things that Amazon or cloud computing can do, but I thought Andy Jassy summarized these nicely in these six points.

1. Replace capex with opex
2. lower total costs of ownership
3. no guessing about capacity
4. encourage agility & innovation
5. differentiation
6. global from the start


By far the biggest announcement at the show is Amazon’s new Redshift product. It is a fully managed datawarehouse solution that scales to petabytes in it’s cloud. Currently there are two business intelligence tools that are supported namely Jaspersoft and Microstrategy.

In 2003 Amazon was a 5 billion dollar company. Today AWS adds the same infrastructure capacity everyday to it’s availability zones!

Reduced prices by 25% for S3

As a lot of folks know, Amazon has always been about cheaper prices. That model has been disruptive in the book selling industry, and in a huge way in the infrastructure and datacenter industry. As more customers signup, economies of scale mean they can offer the same hardware & services for lower prices.

With that they’re announcing lower prices for S3 by a whopping 25%. To me this speaks to their continuing push to dominate the market by driving prices downward.

Amazon’s Channel on Youtube

If you weren’t able to attend the conference, or want to recap some highlights you might have missed, they have put up a great AWS Channel on Youtube.

Some of the speakers include Sharon Chiarella VP Mechanical Turk, Glenn Hazard, CEO, Xceedium, Todd Barr CMO of Alfresco talks, Bright Fulton, Operations for Swipely, Colin Percival, FreeBSD Developer, Ted Dunning, Chief Application Architect of MapR Technologies, James Broberg, CTO & Founder of MetaCDN, Mitchell Garnaat, Sr. Engineer, David Etue, Vice President, SafeNet, and Mike Culver, Sr. Consultant to name just a few.

Read this far? Grab our Scalable Startups for more tips and special content.