Tag Archives: redshift

5 tech challenges I’m thinking about today

fast fish

Technical operations & startup tech are experiencing an incredible upheaval which is bringing a lot of great things.

Join 32,000 others and follow Sean Hull on twitter @hullsean.

Here are some of the questions it raises for me.

1. Are we adopting Docker without enough consideration?

Container deployments are accelerating at a blistering pace. I was reading Julian Dunn recently, and he had an interesting critical post Are container deployments like an oncoming train?

He argues that we should be wary of a few trends. One of taking legacy applications and blindly containerizing them. Now we can keep them alive forever. :) He also argues that there is a tendency for folks who aren’t particularly technical or qualified who start evangelizing it everywhere. A balm for every ailment!

Also: Is Amazon too big to fail?

2. Is Redshift supplanting hadoop & spark for startup analytics?

In a recent blog post I asked Is Redshift outpacing hadoop as the big data warehouse for startups.

On the one hand this is exciting. Speed & agile is always good right? But what of more Amazon & vendor lock-in?

Related: Did Dropbox have to fail?

3. Does devops automation make all of operations a software development exercise?

I asked this question a while back on my blog. Is automation killing old-school operations?

Automation suites like Chef & Puppet are very valuable, in enabling the administration of fleets of servers in the cloud. They’re essential. But there’s some risk in moving further away from the bare metal, that we might weaken our everyday tuning & troubleshooting skills that are essential to technical operations.

Read: When hosting data on Amazon turns bloodsport?

4. Is the cloud encouraging the old pattern of throwing hardware at the problem?

Want to scale your application? Forget tighter code. Don’t worry about tuning SQL queries that could be made 1000x faster. We’re in the cloud. Just scale out!

That’s right with virtualization, we can elastically scale anything. Infinitely. :)

I’ve argued that throwing hardware at the problem is like kicking the can down the road. Eventually you have to pay your technical debt & tune your application.

Also: Are SQL databases dead?

5. Is Amazon disrupting venture capital itself?

I’m not expert on the VC business. But Ben Thompson & James Allworth surely are. And they suggested that because of AWS, startups can setup their software for pennies.

This resonates loud & clear for me. Why? Because in the 90’s I remember startups needing major venture money to buy Sun hardware & Oracle licenses to get going. A half million easy.

They asked Is Amazon Web Services enabling AngelList syndicates to disrupt the Venture capital business? That’s a pretty interesting perspective. It would be ironic if all of this disruption that VC’s bring to entrenched businesses, began unravel their own business!

Also: Are we fast approaching cloud-mageddon?

Get more. Grab our exclusive monthly Scalable Startups. We share tips and special content. Our latest Why I don’t work with recruiters

Is Redshift outpacing Hadoop as the big data warehouse for startups?

redshift hadoop killer

More and more startups are looking at Redshift as a cheaper & faster solution for big data & analytics.

Join 32,000 others and follow Sean Hull on twitter @hullsean.

Saggi Neumann posted a pretty good side-by-side comparison of Redshift & Hadoop and concluded they were tied based on your individual use case.

Meanwhile Bitly engineering concluded Redshift was much easier.

1. More agile

One thing pointed out by the bitly blog post, which I’ve seen countless times, is the slow iteration cycle. Write your map-reduce job, run, test, debug, then run on your cluster. Wait for it to return and you might feel like you’re submitting a stack of punched cards. LOL Resolve the errors that come back and then rerun on your cluster. Over & over & over again.

With Redshift you’re writing SQL, so your iterating through syntax errors quickly. What’s more since Redshift is a column-compressed database, you can do full table scans on columns without indexes.

What that means for you and me is that queries just run. And they run blazingly fast!

Also: When hosting data on Amazon turns bloodsport

2. Cheap

Redshift is pretty darn cheap.

Saggi’s article above quotes Redshift at $1000/TB/yr for reserved, and $3700/TB/yr for on-demand. This compared with a hadoop cluster at $5000/TB/yr.

But neither will come with spitting distance of the old-world of Oracle, where customers host big iron servers in their own datacenter, paying north of a million dollars between hardware & license costs. Amazon cloud FTW!

Related: Did dropbox have to fail?

3. Even faster

Airbnb’s nerds blog has a post showing it costing 25% of a Hadoop cluster, and getting 5x performance boost. That’s pretty darn impressive!

Flydata has done benchmarks showing 10x speedup.

Read: Are SQL Databases dead?

4. SQL Toolchains

***

Also: 5 core pieces of the Amazon cloud puzzle to get your project off the ground

5. Limitations

o data loading

You load data into Redshift using the COPY command. This command reads flat files from S3 and dumps them into tables. It can be extremely fast if you do things in parallel. However getting your data into those flat files is up to you.

There are a few solutions to this.

– amazon data pipeline

This is Amazon’s own toolchain, which allows you to move data from RDS & other Amazon hosted data sources. Data pipeline does not move data realtime, but in batch. Also it doesn’t take care of schema changes so you have to do that manually.

I mentioned it in my 5 reasons to move data to Amazon Redshift

– Flydata service

Flydata is a service with a monthly subscription which will connect to your RDS database, and move the data into Redshift. This seems like a no brainer, and given the heft pricetag of thousands per month, you’d expect it to cover your bases.

In my experience there are a lot of problems & it still required a lot of administration. When schema changes happen, those have to be carefully applied on Redshift. What’s more there’s no silver bullet around the datatype differences.

Also: Some thoughts on 12 factor apps

Flydata also makes use of the binary logs to replicate your data. Anything that doesn’t show up in the binary logs is going to cause you trouble. That includes when you do sql_log_bin=0 in the session, an SQL statement includes a no logging hint. Also watch out for replicate-ignore-db options in your my.cnf. But it also will fail if you use ON DELETE CASCADE. That’s because these downstream changes happen via Constraint in MySQL. But… drumroll please, Redshift doesn’t support ON DELETE CASCADE. In our case the child tables ended up with extra rows, and some queries broke.

– scripts such as Donors choose loader

Donors Choose has open sourced their nightly Redshift loader script. It appears to reload all data each night. This will nicely sidestep the ON DELETE CASCADE problem. As you grow though you may quickly hit a point where you can’t load the entire data set each night.

Their script sources from Postgres, though I’m interested to see if it can be modified for MySQL RDS.

– Tried & failed with Tungsten replicator

Theoretically Tungsten replicator can do the above. What’s more it seems like a tool custom made for such a use case. I tried for over a month to troubleshoot. I worked closely with the team to iron out bugs. I wrote wrestling with bears or how I tamed Tungsten replicator for MySQL and then I wrote a second article Tungsten replicator the good the bad & the ugly. Ultimately I did get some data moving between MySQL RDS & Redshift, however certain data clogged the system & it wouldn’t work for any length of time.

Also: Secrets of a happy Amazon hacker or how to lock down your account with IAM and multi-factor authentication

o data types & character sets

There are a few things here to keep in mind. Redshift counts bytes, so if in mysql or some other database you had a varchar(5) it may be varchar(20) in Redshift. Even then I had cases where it still didn’t fit & I had to make the field bigger by 4.

I also ran into problems around string character encodings. According to the docs Redshift handles 4-byte UTF-8.

Redshift doesn’t support ARRAYs, BIT, BYTEA, DATE/TIME, ENUM, JSON and a bunch of others. So don’t go into it expecting full Postgres support.

What you will get are multibyte characters, numeric, character, datetime, boolean and some type conversion.

Also: Is the difference between dev & ops a four-letter word?

o rebalancing

If and when you want to add nodes, expect some downtime. Yes theoretically the database is online while it’s shipping data to the new nodes & redistributing things, the latency can start to feel like an outage. What’s more it can easily push into the hours to do.

Also: Is AWS enabling startups which enable AngelList Syndicates to boil the VC business?

Get more. Grab our exclusive monthly Scalable Startups. We share tips and special content. Our latest Why I don’t work with recruiters

Replicate MySQL to Amazon Redshift with Tungsten: The good, the bad & the ugly

tungsten replicator

Heterogenous replication involves moving data from one database platform to another. This is a complicated endevour because datatypes, date & time formats, and a whole lot more tend to differ across platforms. In fact it’s so complex many enterprises simply employ a commercial solution to take away the drudgery.

Join 31,000 others and follow Sean Hull on twitter @hullsean.

Enter Tungsten, which supports these types of deployments, on platforms as Postgresql, Mongodb, Oracle, Redshift, Vertica. With custom built appliers the field is infinite!

With that I’ve set out to get things working with Amazon Redshift. If you’re still struggling with the basics check out Wrestling with bears or how I tamed Tungsten Replicator.

1. Connect to redshift

The first thing you’ll need to do is allow your Tungsten boxes to reach redshift. Seems obvious, but when you’re juggling all these apples & oranges for the first time, it may slip you mind.

Configure your AWS security group to allow tungsten boxes

Get the external IP address of your tungsten box. If it’s in DNS this will work even if ping doesn’t.


$ ping tungsten01.mydomain.net

Add 10.20.30.40/32 to your Redshift security config. I created a special group called Tungsten and added the two tungsten boxes by IP address. That’s because these machines were on a different AWS account. If they’re on the same account, you could allow the entire EC2 group, and be done.

Install psql client

The best way I found to test the connection was psql. Install that:


$ apt-get install postgresql-client

Verify your connection:


$ psql -p 5439 -h 10.20.10.20 --username=root -d dwh

Also: Are SQL Databases dead?

2. Configure S3 access

Tungsten uses S3 heavily to move data into Redshift.

(I outlined this previously in 5 Reasons to move data to Amazon Redshift.

Install s3tools package

Tungsten uses the s3cmd to interface with the Amazon S3 API. Let’s install that:


$ apt-get install s3cmd

Now edit the .s3cfg file of tungsten user. Change


[default]
access_key = AAAAAAA
secret_key = BBBBBBB

Lastly edit the tungsten /opt/continuent/share/s3-config-redshift.json. There are four parameters.


{
"awsS3Path" : "s3://tungstenbucket",
"awsAccessKey" : "AAAAAAA",
"awsSecretKey" : "BBBBBBB",
"cleanUpS3Files" : "false",
}

Related: Is Oracle killing MySQL?

3. Create tables on Redshift

In a heterogenous environment, that is where source and destination databases are different platforms, Tungsten cannot create tables for you.

It will however, give you a helping hand in the process. Enter the ddlscan tool, which scans the CREATE TABLE statements on your source database, and generates them for your target platform.

For each table in source database, there will be a stage table in Redshift:


$ ddlscan jdbc:mysql://localhost:3306/test -user sync -db test -template ddl-mysql-redshift-staging.vm > test_stage.sql

$ cat test_stage.sql
/*
SQL generated on Thu Jun 04 20:06:45 UTC 2015 by ./ddlscan utility of Tungsten

url = jdbc:mysql:thin://tungsten01.mydomain.net:3306/test?jdbcCompliantTruncation=false&zeroDateTimeBehavior=convertToNull&tinyInt1isBit=false&allowMultiQueries=true&yearIsDateType=false
user = sync
dbName = test
*/

CREATE SCHEMA test;

DROP TABLE test.stage_xxx_sean;
CREATE TABLE test.stage_xxx_sean
(
tungsten_opcode CHAR(2),
tungsten_seqno INT,
tungsten_row_id INT,
tungsten_commit_timestamp TIMESTAMP,
c1 VARCHAR(256) /* VARCHAR(64) */,
id INT,
PRIMARY KEY (tungsten_opcode, tungsten_seqno, tungsten_row_id)
);

And also a base table in redshift:


$ ddlscan jdbc:mysql://localhost:3306/test -user sync -db test -template ddl-mysql-redshift.vm > test.sql

$ cat test.sql
/*
SQL generated on Thu Jun 04 20:06:51 UTC 2015 by ./ddlscan utility of Tungsten

url = jdbc:mysql:thin://tungsten01.mydomain.net:3306/test?jdbcCompliantTruncation=false&zeroDateTimeBehavior=convertToNull&tinyInt1isBit=false&allowMultiQueries=true&yearIsDateType=false
user = sync
dbName = test
*/

CREATE SCHEMA test;

DROP TABLE test.sean;
CREATE TABLE test.sean
(
c1 VARCHAR(256) /* VARCHAR(64) */,
id INT,
PRIMARY KEY (id)
);

Lastly apply those scripts to your redshift database:


$ psql
dwh# \i file_stage.sql
dwh# \i file_table.sql

Read: Are we fast approaching cloud-mageddon?

4. Troubleshoot applier

***
Encountered “Delimiter Not Found” issue

This issue was mysterious and remains so a bit. What I did to fix it:

had an issue with the path, but fixed that:


  "awsS3Path" : "s3://tungstenbucket",

It was causing an interim bucket to be created. But that did not solve things.

Ok. So I hacked this a bit.

Anyone can help me troubleshoot what happened & why?

A. I skipped transactions

I brought the applier back online with this command.


trepctl -service redshift online -skip-seqno 1,1-100

B. I did lots of inserts & deletes on MySQL

I then did about 200 of these:


mysql> insert into test.sean values ('hi there', 20);
mysql> delete from test.sean where id = 20;

C. Now seeing data


dwh=# select * from test.sean;
                 c1                  | id 
-------------------------------------+----
 working......                       | 25
 hello sean i have an exclamation !! | 27
 hello sean i came from mysql        | 26
(3 rows)

I also set cleanupS3Files to false. Now I’m seeing files like this:
test-sean-417.csv
test-sean-418.csv
test-sean-419.csv
test-sean-420.csv

So that indicates all those INSERT followed by DELETES cleaned up things.

Also: How do I find entrepreneurial focus?

5. Test data & table changes

B. Tested INSERT

At first the csv files were getting cleanedup by Tungsten. I added this option to s3-config-redshift.json file:


"cleanUpS3Files" : "false",

Then the files are kept around so we can review them. An insert record shows up in S3 like this:


"I","417","1","2015-06-05 17:44:35.000","tungsten new csv file? ","33",null

C. Tested DELETE

A DELETE record shows up in S3 like this:


"D","419","1","2015-06-05 17:45:48.000",null,"26",null

D. Tested UPDATE

An UPDATE record shows up in S3 like this:


"D","420","1","2015-06-05 17:48:55.000",null,"31",null
"I","420","2","2015-06-05 17:48:55.000","changed message text for redshift+tungsten update","31",null

A. Tested ALTER TABLE

As mentioned previously, this is *NOT* supported. However after doing the ALTER, the applier does *NOT* go offline. Also there are no errors. That’s because Tungsten does not support these and will filter them in a heterogenous environment.

The applier *DOES* go offline, after you try a new INSERT. That’s because it gets a new record for INSERT that doesn’t match.

“trepctl status” shows the following:

pendingExceptionMessage: CSV loading failed: schema=test table=sean CSV file=/tmp/staging/redshift/staging0/test-sean-413.csv message=Wrapped org.postgresql.util.PSQLException: ERROR: Load into table ‘stage_xxx_sean’ failed. Check ‘stl_load_errors’ system table for details. (../../tungsten-replicator//samples/scripts/batch/redshift.js#145)

redshift# alter table test.sean add column c3 integer default null;

redshift# alter table test.stage_xxx_sean add column c3 integer default null;

Then I brought the applier back online:

$ trepctl -service redshift online

Then check the status. It should say ONLINE for state.


$ trepctl status
Processing status command...
NAME VALUE
---- -----
appliedLastEventId : mysqld-bin.000022:0000000000000566;-1
appliedLastSeqno : 424
appliedLatency : 300585.739
autoRecoveryEnabled : false
autoRecoveryTotal : 0
channels : 1
clusterName : redshift
currentEventId : NONE
currentTimeMillis : 1433878195573
dataServerHost : my-dw.aaaa.us-east-1.redshift.amazonaws.com
extensions :
host : my-dw.aaaa.us-east-1.redshift.amazonaws.com
latestEpochNumber : 0
masterConnectUri : thl://tungsten01.mydomain.net:2112/
masterListenUri : null
maximumStoredSeqNo : 424
minimumStoredSeqNo : 0
offlineRequests : NONE
pendingError : NONE
pendingErrorCode : NONE
pendingErrorEventId : NONE
pendingErrorSeqno : -1
pendingExceptionMessage: NONE
pipelineSource : thl://tungsten01.mydomain.net:2112/
relativeLatency : 304511.573
resourcePrecedence : 99
rmiPort : 10000
role : slave
seqnoType : java.lang.Long
serviceName : redshift
serviceType : local
simpleServiceName : redshift
siteName : default
sourceId : my-dw.aaaa.us-east-1.redshift.amazonaws.com
state : ONLINE
timeInStateSeconds : 351940.007
timezone : GMT
transitioningTo :
uptimeSeconds : 600921.759
useSSLConnection : false
version : Tungsten Replicator 4.0.0 build 18
Finished status command...
$

Lastly, let’s see what’s in the table, fire up the postgresql shell and take a look:


dwh=# select * from test.sean;
c1 | id | c3
---------------------------------------------------+----+----
working...... | 25 |
hello sean i have an exclamation !! | 27 |
hello will i break? | 30 |
some more records | 32 |
tungsten new csv file? | 33 |
another tungsten csv file? | 34 |
changed message text for redshift+tungsten update | 31 |
(7 rows)

Also: Was Fred Wilson wrong about Apple?

Get more. Grab our exclusive monthly Scalable Startups. We share tips and special content. Our latest Why I don’t work with recruiters

Business Agility at AWS re:Invent

Also find Sean Hull’s ramblings on twitter @hullsean.

Although I couldn’t be in Vegas to attend re:Invent, there is so much online it’s almost better than being at the conference. From an ongoing live stream of keynotes and sessions, to an archived collection on Youtube.

The big wins

You may have heard of all the great things that Amazon or cloud computing can do, but I thought Andy Jassy summarized these nicely in these six points.

1. Replace capex with opex
2. lower total costs of ownership
3. no guessing about capacity
4. encourage agility & innovation
5. differentiation
6. global from the start

Redshift

By far the biggest announcement at the show is Amazon’s new Redshift product. It is a fully managed datawarehouse solution that scales to petabytes in it’s cloud. Currently there are two business intelligence tools that are supported namely Jaspersoft and Microstrategy.

[quote]
In 2003 Amazon was a 5 billion dollar company. Today AWS adds the same infrastructure capacity everyday to it’s availability zones!
[/quote]

Reduced prices by 25% for S3

As a lot of folks know, Amazon has always been about cheaper prices. That model has been disruptive in the book selling industry, and in a huge way in the infrastructure and datacenter industry. As more customers signup, economies of scale mean they can offer the same hardware & services for lower prices.

With that they’re announcing lower prices for S3 by a whopping 25%. To me this speaks to their continuing push to dominate the market by driving prices downward.

Amazon’s Channel on Youtube

If you weren’t able to attend the conference, or want to recap some highlights you might have missed, they have put up a great AWS Channel on Youtube.

Some of the speakers include Sharon Chiarella VP Mechanical Turk, Glenn Hazard, CEO, Xceedium, Todd Barr CMO of Alfresco talks, Bright Fulton, Operations for Swipely, Colin Percival, FreeBSD Developer, Ted Dunning, Chief Application Architect of MapR Technologies, James Broberg, CTO & Founder of MetaCDN, Mitchell Garnaat, Sr. Engineer, David Etue, Vice President, SafeNet, and Mike Culver, Sr. Consultant to name just a few.

Read this far? Grab our Scalable Startups for more tips and special content.