Categories
All Database Management Database Migrations Database Operations Redshift

How to build an operational datastore on AWS with S3 & Redshift

via GIPHY

You’re building your data warehouse, and getting data into Redshift. You’ve got your ETL pipeline running, and presentation layer talking to the warehouse. Great.

But how to get access to that source data? Wouldn’t it be nice if that was close by too?

Join 35,000 others and follow Sean Hull on twitter @hullsean.

It may be you have 10-zillion rows of source data and don’t want or need to get all of that into Redshift and keep it there. But it would be nice to have access to it when you do.

Enter EXTERNAL tables, aka Spectrum. Now you can keep all your raw data in S3, an in place operational datastore of data before it’s been reworked and transformed. Use SQL to access it right where it sits.

Get all the advantages of lifecycle management in S3, and don’t pay all the redshift costs for data you don’t need all the time. Cool!

Let’s see how it works.

What is an EXTERNAL table?

Spectrum is Amazon’s rebranding of an old database technology called EXTERNAL TABLES. Back in the 90’s Oracle pioneered this work, allowing you to essentially map a CSV file, that sits outside the database proper. This means you can query all that juicy data sitting in flat files. Cool!

Athena allows you to query this stuff as a service, native to AWS. Spectrum allows you to create those external tables inside of Redshift.

Also: Top serverless interview questions for hiring aws lambda experts

Give Redshift permissions

Go into IAM and create a new role called “SeanSpectrumRole”. Assign the policy AmazonS3ReadOnlyPolicy. It looks like this:


{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"s3:Get*",
"s3:List*"
],
"Resource": "*"
}
]
}

If you’re using the dashboard you just pick the policy from the named list. However if you’re using CloudFormation, you’ll use the code above.

Now navigate your aws console to the Redshift dashboard, click clusters, and click the checkbox for your cluster. Probably there’s only one.

Now click the “Manage IAM Roles” button, and a dialog should popup.. Select the role you created earlier, SeanSpectrumRole. Then click “Apply Changes”.

The beauty of the AWS world is that servers themselves can have API permissions. In this case we gave the redshift cluster or server itself, access to S3 for our use below!

Related: Which engineering roles are in greatest demand?

Create your spectrum schema

First you must create a spectrum schema. Here’s the syntax:


create external schema spectrum
from data catalog
database 'sean'
region 'us-east-1'
iam_role 'arn:aws:iam::9999999999999:role/SeanSpectrumRole';

Read: Can on-demand consulting save startups time & money?

Upload your data to S3 bucket

Here we create an s3 bucket called sean_spectrum, then upload one csv file named sean_numbers.txt.


$ aws s3api create-bucket --bucket sean_spectrum --region us-east-1
{
"Location": "/sean_spectrum"
}
$ cd spectrum/
$ cat sean_numbers.txt
21,Dr.,Who,44-22-55-77-88
35,Bat,Man,317-222-4777
15,Wonder,Woman,999-324-7878
99,Storm,Cloud,367-399-6767
75,Marvel,Girl,222-333-9595
32,Quick,Silver,22-33-77-99
12,Scarlet,Witch,23-35-47-555
$ aws s3 cp sean_numbers.txt s3://sean_spectrum/
upload: ./sean_numbers.txt to s3://sean_spectrum/sean_numbers.txt
$ aws s3 ls s3://sean_spectrum/
2017-05-18 20:28:41 193 sean_numbers.txt
$

Note the names. The table name won’t turn out to be sean_numbers. It will be called sean_spectrum, and all files inside that directory will be queried. So make sure they have consistent formats!

Also: 30 questions to ask a serverless fanboy

Create & query your external table

Here’s how you create your external table. Note this is just a map to data. The data is still stored in S3. it is not brought into Redshift except to slice, dice & present.


mydb=# create external table spectrum_schema.sean_numbers(
id int,
fname string,
lname string,
phone string)
row format delimited
fields terminated by ','
stored as textfile
location 's3://sean_spectrum/';

Here’s how you query it:


mydb=# select * from spectrum_schema.sean_numbers order by id;
id | fname | lname | phone
----------------+---------------
12 | Scarlet | Witch | 23-35-47-555
15 | Wonder | Woman | 999-324-7878
21 | Dr. | Who | 44-22-55-77-88
32 | Quick | Silver | 22-33-77-99
35 | Bat | Man | 317-222-4777
75 | Marvel | Girl | 222-333-9595
99 | Storm | Cloud | 367-399-6767

Cool. We reordered data read from an S3 file!!!

Although you can’t create a view over a redshift table *AND* an S3 external table, you can query them together.

So for example if I have a table in redshift with addresses, I can join them together:

mydb=# select a.id, a.fname, a.lname, b.address from spectrum_schema.sean_numbers a, sean_addresses b
where a.id = b.id order by id;

id | fname | lname | phone | address
----------------+----------------------------
12 | Scarlet | Witch | 23-35-47-555 | 10 main st
15 | Wonder | Woman | 999-324-7878 | 25 center st
21 | Dr. | Who | 44-22-55-77-88 | 32 broadway
32 | Quick | Silver | 22-33-77-99 | 1 first st
35 | Bat | Man | 317-222-4777 | 99 west st
75 | Marvel | Girl | 222-333-9595 | 66 East Ave
99 | Storm | Cloud | 367-399-6767 | 50 North st

Also: What can startups learn from the DYN DNS outage?

Get more. Grab our exclusive monthly Scalable Startups. We share tips and special content. Our latest Why I don’t work with recruiters

Categories
Data Database Management Devops NoSQL Redshift

A roughneck walk down database alley

via GIPHY

I was just responding to some Disqus comments on a recent blog post. Admittedly it had a provocative title Will SQL databases just die already. What do you think?

Join 34,000 others and follow Sean Hull on twitter @hullsean.

A reader pointed out that some No-SQL databases do support joins. Huh? My face contorts in total confusion. How? Why?

For years SQL was misunderstood & unloved

Relational databases have been around for decades. 43 years according to the StackExchange article. That’s a lot of years. I’ve spent a few years as a dba, aka database administrator. The role can be distilled down as a herder of sorts. Keep all the data bits in the right boxes with the right labels. A digital librarian, that makes sure the books don’t get lost.

Of course patrons don’t always put books back where they should, and strict rules get put in place to avoid losing that one volume of shakespear in miles of shelves.

In the fast moving world of web + mobile, product is king, and agile rules the day. And anything that can make us more agile also wins. SQL, much maligned & misunderstood, was not one of those things.

Also: Top serverless interview questions for hiring aws lambda experts

NoSQL burst on the scene with much fanfare

With all that pressure, it was no wonder engineers thought, there must be a better way. Then along comes the No SQL database. I mean just the name speaks volumes about the design goal.

We’ll sacrifice anything, just please don’t make me write SQL!

The promises…

1. Never have to deal with pesky SQL that we don’t understand!
2. Interact with the database like any other data structure in our code!
3. Be schemaless! Crotchety Database Administrators be damned!
4. Be distributed. Be everywhere consistent! Be indistinguishable from magic!
5. Always be fast.

In that rush into the abyss, we lost track of durability. And down the rabbithole we went!

Related: Which engineering roles are in greatest demand?

Relational databases tried to be key-value

Then I started hearing about crazy things, like MySQL providing a memcache plugin, so you could use it albeit lightening fast, as a key-value store. You could sidestep that pesky SQL engine, and get right down to the bare metal. But why? Memcache & Redis were already doing that & purpose built. Why indeed?

I started to argue maybe we shouldn’t be muddying the waters. I mean stick with what you know!

Read: Can on-demand consulting save startups time & money?

War was won, success declared

Around this I think was when Mongodb was declaring the war won. We had finally left SQL databases in the dustheap of history. It may or may not have inspired this popular youtube skit…

Also: 30 questions to ask a serverless fanboy

Meanwhile hadoop is losing ground. Bigquery & Redshift both speak SQL

But then something funny started to happen. It seemed there was a backlash against Mongodb. A lot of customers were losing data. (Yep that’s what durability means guys…) And the hype started reversing. Even the mighty hadoop has been losing popularity of late. How long does it take to write an EMR job versus an SQL query. Let’s be honest?

I asked myself, Is Hadoop losing ground to SQL warehouses like Redshift & Bigquery?. I wonder.

Also: What can startups learn from the DYN DNS outage?

NoSQL databases are looking for JOINs?

Recently I bumped into some interesting blog comments & discussions about how Orientdb was trying to add joins to their product.

As certain relational databases try to become No SQL databases, other No SQL databases are trying to add more complex SQL, because well somehow their product is missing something.

Also: What can startups learn from the DYN DNS outage?

Engineering truth versus fashion

43 years is a lot of years. And when we drop all the fashion trends in tech, and the new database du jour, what do we find?

There is room for No SQL databases. Yep. And the do certain things, and solve certain types of problems well. But their not general workhorses, nor can they slice and dice your data however you like. And when you get to that point in your project, you’re going to want to ask interesting questions of your data.

And surely that’s where SQL excels. It ain’t going anywhere, folks!

Also: What can startups learn from the DYN DNS outage?

Get more. Grab our exclusive monthly Scalable Startups. We share tips and special content. Our latest Why I don’t work with recruiters

Categories
All Data Database Management Database Operations

Will SQL just die already?

With tons of new No-SQL database offerings everyday, developers & architects have a lot of options. Cassandra, Mongodb, Couchdb, Dynamodb & Firebase to name a few.

Join 33,000 others and follow Sean Hull on twitter @hullsean.

What’s more in the data warehouse space, you have Hadoop, which can churn through terabytes of data and get you results back before lunchtime!

So when I stumbled on this article SQL is 43 years old, I was intrigued.

Answer the questions you haven’t thought of

No-SQL databases are great if you know how you want to access the data. Users come from the users table, and that’s that!

But if later on you want to ask questions like, which users watched this video, which users are active, which users spent $100 in January? These questions may not be possible because NoSQL can’t join those other tables.

Relational databases shine when you need to aggregate your data, reorganize it, or ask unanticipated questions. And aren’t those most of the interesting questions?

Also: Top serverless interview questions for hiring aws lambda experts

Big Query, Redshift & even Hive speak SQL

I wrote that despite recent popularity in Hadoop, Redshift seems to be eating their lunch. And what would you know, surprise surprise, Amazon’s newish data warehousing solution, speaks SQL! What’s more there’s Apache Hive, which allows you to query Hadoop with, drumroll please… SQL!

Bigquery is the other major bigdata offering from none other than Google. And it too uses SQL!

Related: Which engineering roles are in greatest demand?

Still dominant

If you look at Stackoverflow’s developer survey, you’ll see that SQL is the second most popular language. Why might that be? For one thing it’s simple to learn. Enough that even business users can write simple requests, join & aggregate data.

Read: Can on-demand consulting save startups time & money?

Rugged, Proven & Open

SQL having been around so long is a fairly open standard. Sure there are extensions of it, but most of the basic stuff is there in all the products. That means you learn it once, and can interact with databases across the spectrum. That’s a win for everybody.

Also: 30 questions to ask a serverless fanboy

Business users can write it

Another under appreciated feature though is that basic queries are easy to write. They don’t require complex syntax like a hadoop job, or your favorite imperative programming language. The queries are readable, almost english-like sentences.

Given all that, it seems SQL is likely to be around for a long time to come!

Also: What can startups learn from the DYN DNS outage?

Get more. Grab our exclusive monthly Scalable Startups. We share tips and special content. Our latest Why I don’t work with recruiters