Category Archives: Database Management

How do I migrate my skills to the cloud?

via GIPHY


Hi, I’m currently an IT professional and I’m training for AWS Solutions Architect – Associate exam. My question is how to gain some valuable hands-on experience without quitting my well-paying consulting gig I currently have which is not cloud based. I was thinking, perhaps I could do some cloud work part time after I get certified.

Join 38,000 others and follow Sean Hull on twitter @hullsean.


I work in the public sector and the IT contract prohibits the agency from engaging any cloud solutions until the current contract expires in 2019. But I can’t just sit there without using these new skills – I’ll lose it. And if I jump ship I’ll loose $$$ because I don’t have the cloud experience.


Hi George,

Here’s what I’d suggest:

1. Setup your AWS account

A. open aws account, secure with 2FA & create IAM roles

First things first, if you don’t already have one, go signup. Takes 5 minutes & a credit card.

From there be sure to enable two factor authentication. Then stop using your root account! Create a new IAM user with permissions to command line & API. Then use that to authenticate. You’ll be using the awscli python package.

Also: Is Amazon too big to fail?

2. Automatic deployments

B. plugin a github project
C. setup CI & deployment
D. get comfy with Ansible

Got a pet project on github? If not it’s time to start one. ๐Ÿ™‚

You can also alternatively use Amazon’s own CodeCommit which is a drop-in replacement for github and works fine too. Get your code in there.

Next setup codedeploy so that you can deploy that application to your EC2 instance with one command.

But you’re not done yet. Now automate the spinup of the EC2 instance itself with Ansible. If you’re comfortable with shell scripts, or other operational tools, the learning curve should be pretty easy for you.

Read: Is AWS too complex for small dev teams? The growing demand for Cloud SRE

3. Clusters

E. play around with kubernetes or docker swarm

Both of these technologies allow you to spinup & control a fleet of containers that are running on a fixed set of EC2 instances. You may also use Amazon ECS which is a similar type of offering.

Related: How to deploy on EC2 with Vagrant

4. Version your infrastructure

F. use terraform or cloudformation to manage your aws objects
G. put your terraform code into version control
H. test rollback & roll foward infrastructure changes

Amazon provides CloudFormation as it’s foundational templating system. You can use JSON or YAML. Basically you can describe every object in your account, from IAM users, to VPCs, RDS instances to EC2, lambda code & on & on all inside of a template file written in JSON.

Terraform is a sort of cloud-agnostic version of the same thing. It’s also more feature rich & has got a huge following. All reasons to consider it.

Once you’ve got all your objects in templates, you can checkin these files into your git or CodeCommit repository. Then updating infrastructure is like updating any other pieces of code. Now you’re self-documenting, and you can roll-forward & backward if you make a mistake!

Related: How I use terraform & composer to automate wordpress on AWS

5. Learn serverless

I. get familiar with lambda & use serverless framework

Building applications & deploying only code is the newest paradigm shift happening in cloud computing. On Amazon you have Lambda, on Google you have Cloud Functions.

Related: 30 questions to ask a serverless fanboy

6. Bonus: database skills

J. Learn RDS – MySQL, Postgres, Aurora, Oracle, SQLServer etc

For a bonus page on your resume, dig into Amazon Relational Database Service or RDS. The platform supports various databases, so try out the ones you know already first. You’ll find that there are a few surprises. I wrote Is upgrading RDS like a sh*t storm that will not end?. That was after a very frustrating weekend upgrading a customers production RDS instance. ๐Ÿ™‚

Related: Is Amazon about to disrupt your data warehouse?

7. Bonus: Data warehousing

K. Redshift, Spectrum, Glue, Quicksight etc

If you’re interested in the data side of the house, there is a *LOT* happening at AWS. From their spectrum technology which allows you to keep most of your data in S3 and still query it, to Glue which provides an ETL as a service offering.

You can also use a world-class columnar storage database called Redshift. This is purpose built for reporting & batch jobs. It’s not going to meet your transactional web-backend needs, but it will bring up those Tableau reports blazingly fast!

Related: Is Amazon about to disrupt your data warehouse?

8. Now go find that cloud deployment job!


With the above under your belt there’s plenty of work for you. There is tons of demand right now for this stuff.

Did you do learn all that? You’ve now got very very in-demand skills. The recruiters will be chomping at the bit. Update those buzzwords (I mean keywords). This will help match you with folks looking for someone just like you!

Related: Why I don’t work with recruiters

Get more. Grab our exclusive monthly Scalable Startups. We share tips and special content. Our latest Why I don’t work with recruiters

How to interview an amazon database expert

via GIPHY

Amazon releases a new database offering every other day. It sure isn’t easy to keep up.

Join 35,000 others and follow Sean Hull on twitter @hullsean.

Let’s say you’re hiring a devops & you want to suss out their database knowledge? Or you’re hiring a professional services firm or freelance consultant. Whatever the case you’ll need to sift through for the best people. Here’s how.

Also: How to interview an AWS expert

What database does Amazon support for caching?

Caching is a popular way to speed up access to your backend database. Put Amazon’s elasticache behind your webserver, and you can reduce load on your database by 90%. Nice!

The two types that amazon supports are Memcache & Redis. Memcache is historically more popular. These days Redis seems a clear winner. It’s faster, and can maintain your cached data between restarts. That will save you I promise!

Also: Is AWS too complex for small dev teams?

How can I store big data in AWS?

Amazon’s data warehouse offering is called Redshift. I wrote Why is everyone suddenly talking about Redshift?. Why indeed!

When you’re doing large reports for your business intelligence team, you don’t want to bog down your backend relational database. Redshift is purpose built for this use case.

I’ve see a report that took over 8 hours in MySQL return in under 60 seconds in Redshift!

A new offering is Amazon Spectrum. This tech is super cool. Load up all your data into S3, in standard CSV format. Then without even loading it into Redshift, you can query the S3 data directly. This is super useful. Firstly because S3 is 1/10th the price. But also because it allows you to stage your data before loading into Redshift itself. Goodbye Google Big Query! I talked about spectrum here.

Related: Which engineering roles are in greatest demand?

What relational database options are there on Amazon?

Amazon supports a number of options through it’s Relational Database Service or RDS. This is managed databases, which means less work on your DBAs shoulders. It also may make upgrades slower and harder with more downtime, but you get what you pay for.

There are a lot of platforms available. As you might guess MySQL & Postgres are there. Great! Even better you can use MariaDB if that’s your favorite. You can also go with Aurora which is Amazon’s own home-brew drop in replacement for MySQL that promises greater durability and some speedups.

If you’re a glutton for punishment, you can even get Oracle & SQL Server working on RDS. Very nice!

Read: Can on-demand consulting save startups time & money?

Does AWS have a NoSQL database solution?

If NoSQL is to your taste, Amazon has DynamoDB. According to . I haven’t seen a lot of large production applications using it, but what he describes makes a lot of sense. The way Amazon scales nodes & data I/O is bound to run into real performance problems.

That said it can be a great way to get you up and running quickly.

Read: Can on-demand consulting save startups time & money?

How do I do ETL & migrate data to AWS?

Let’s be honest, Amazon wants to make this really easy. The quicker & simpler it is to get your data there, that more you’ll buy!

Amazon’s Database Migration Service or DMS allows you to configure your old database as a data source, then choose a Amazon db solution as destination, then just turn on the spigot and pump your data in!

ETL is extract transform and load, data warehouse terminology for slicing and dicing data before you load it into your warehouse. Many of todays warehouses are being built with the data lake model, because databases like Redshift have gotten so damn fast. That model means you stage all your source data as-is in your warehouse, then build views & summary tables as needed to speed up queries & reports. Even better you might look a tool like xplenty.

Amazon’s new offering is called Glue. Five ways to get data into Amazon Redshift. This solution is purpose build for creating a powerful data pipeline, complete with python code to do transformations.

Read: Is data your dirty little secret?

Get more. Grab our exclusive monthly Scalable Startups. We share tips and special content. Our latest Why I don’t work with recruiters

How to build an operational datastore on AWS with S3 & Redshift

via GIPHY

You’re building your data warehouse, and getting data into Redshift. You’ve got your ETL pipeline running, and presentation layer talking to the warehouse. Great.

But how to get access to that source data? Wouldn’t it be nice if that was close by too?

Join 35,000 others and follow Sean Hull on twitter @hullsean.

It may be you have 10-zillion rows of source data and don’t want or need to get all of that into Redshift and keep it there. But it would be nice to have access to it when you do.

Enter EXTERNAL tables, aka Spectrum. Now you can keep all your raw data in S3, an in place operational datastore of data before it’s been reworked and transformed. Use SQL to access it right where it sits.

Get all the advantages of lifecycle management in S3, and don’t pay all the redshift costs for data you don’t need all the time. Cool!

Let’s see how it works.

What is an EXTERNAL table?

Spectrum is Amazon’s rebranding of an old database technology called EXTERNAL TABLES. Back in the 90’s Oracle pioneered this work, allowing you to essentially map a CSV file, that sits outside the database proper. This means you can query all that juicy data sitting in flat files. Cool!

Athena allows you to query this stuff as a service, native to AWS. Spectrum allows you to create those external tables inside of Redshift.

Also: Top serverless interview questions for hiring aws lambda experts

Give Redshift permissions

Go into IAM and create a new role called “SeanSpectrumRole”. Assign the policy AmazonS3ReadOnlyPolicy. It looks like this:


{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"s3:Get*",
"s3:List*"
],
"Resource": "*"
}
]
}

If you’re using the dashboard you just pick the policy from the named list. However if you’re using CloudFormation, you’ll use the code above.

Now navigate your aws console to the Redshift dashboard, click clusters, and click the checkbox for your cluster. Probably there’s only one.

Now click the “Manage IAM Roles” button, and a dialog should popup.. Select the role you created earlier, SeanSpectrumRole. Then click “Apply Changes”.

The beauty of the AWS world is that servers themselves can have API permissions. In this case we gave the redshift cluster or server itself, access to S3 for our use below!

Related: Which engineering roles are in greatest demand?

Create your spectrum schema

First you must create a spectrum schema. Here’s the syntax:


create external schema spectrum
from data catalog
database 'sean'
region 'us-east-1'
iam_role 'arn:aws:iam::9999999999999:role/SeanSpectrumRole';

Read: Can on-demand consulting save startups time & money?

Upload your data to S3 bucket

Here we create an s3 bucket called sean_spectrum, then upload one csv file named sean_numbers.txt.


$ aws s3api create-bucket --bucket sean_spectrum --region us-east-1
{
"Location": "/sean_spectrum"
}
$ cd spectrum/
$ cat sean_numbers.txt
21,Dr.,Who,44-22-55-77-88
35,Bat,Man,317-222-4777
15,Wonder,Woman,999-324-7878
99,Storm,Cloud,367-399-6767
75,Marvel,Girl,222-333-9595
32,Quick,Silver,22-33-77-99
12,Scarlet,Witch,23-35-47-555
$ aws s3 cp sean_numbers.txt s3://sean_spectrum/
upload: ./sean_numbers.txt to s3://sean_spectrum/sean_numbers.txt
$ aws s3 ls s3://sean_spectrum/
2017-05-18 20:28:41 193 sean_numbers.txt
$

Note the names. The table name won’t turn out to be sean_numbers. It will be called sean_spectrum, and all files inside that directory will be queried. So make sure they have consistent formats!

Also: 30 questions to ask a serverless fanboy

Create & query your external table

Here’s how you create your external table. Note this is just a map to data. The data is still stored in S3. it is not brought into Redshift except to slice, dice & present.


mydb=# create external table spectrum_schema.sean_numbers(
id int,
fname string,
lname string,
phone string)
row format delimited
fields terminated by ','
stored as textfile
location 's3://sean_spectrum/';

Here’s how you query it:


mydb=# select * from spectrum_schema.sean_numbers order by id;
id | fname | lname | phone
----------------+---------------
12 | Scarlet | Witch | 23-35-47-555
15 | Wonder | Woman | 999-324-7878
21 | Dr. | Who | 44-22-55-77-88
32 | Quick | Silver | 22-33-77-99
35 | Bat | Man | 317-222-4777
75 | Marvel | Girl | 222-333-9595
99 | Storm | Cloud | 367-399-6767

Cool. We reordered data read from an S3 file!!!

Although you can’t create a view over a redshift table *AND* an S3 external table, you can query them together.

So for example if I have a table in redshift with addresses, I can join them together:

mydb=# select a.id, a.fname, a.lname, b.address from spectrum_schema.sean_numbers a, sean_addresses b
where a.id = b.id order by id;

id | fname | lname | phone | address
----------------+----------------------------
12 | Scarlet | Witch | 23-35-47-555 | 10 main st
15 | Wonder | Woman | 999-324-7878 | 25 center st
21 | Dr. | Who | 44-22-55-77-88 | 32 broadway
32 | Quick | Silver | 22-33-77-99 | 1 first st
35 | Bat | Man | 317-222-4777 | 99 west st
75 | Marvel | Girl | 222-333-9595 | 66 East Ave
99 | Storm | Cloud | 367-399-6767 | 50 North st

Also: What can startups learn from the DYN DNS outage?

Get more. Grab our exclusive monthly Scalable Startups. We share tips and special content. Our latest Why I don’t work with recruiters

A roughneck walk down database alley

via GIPHY

I was just responding to some Disqus comments on a recent blog post. Admittedly it had a provocative title Will SQL databases just die already. What do you think?

Join 34,000 others and follow Sean Hull on twitter @hullsean.

A reader pointed out that some No-SQL databases do support joins. Huh? My face contorts in total confusion. How? Why?

For years SQL was misunderstood & unloved

Relational databases have been around for decades. 43 years according to the StackExchange article. That’s a lot of years. I’ve spent a few years as a dba, aka database administrator. The role can be distilled down as a herder of sorts. Keep all the data bits in the right boxes with the right labels. A digital librarian, that makes sure the books don’t get lost.

Of course patrons don’t always put books back where they should, and strict rules get put in place to avoid losing that one volume of shakespear in miles of shelves.

In the fast moving world of web + mobile, product is king, and agile rules the day. And anything that can make us more agile also wins. SQL, much maligned & misunderstood, was not one of those things.

Also: Top serverless interview questions for hiring aws lambda experts

NoSQL burst on the scene with much fanfare

With all that pressure, it was no wonder engineers thought, there must be a better way. Then along comes the No SQL database. I mean just the name speaks volumes about the design goal.

We’ll sacrifice anything, just please don’t make me write SQL!

The promises…

1. Never have to deal with pesky SQL that we don’t understand!
2. Interact with the database like any other data structure in our code!
3. Be schemaless! Crotchety Database Administrators be damned!
4. Be distributed. Be everywhere consistent! Be indistinguishable from magic!
5. Always be fast.

In that rush into the abyss, we lost track of durability. And down the rabbithole we went!

Related: Which engineering roles are in greatest demand?

Relational databases tried to be key-value

Then I started hearing about crazy things, like MySQL providing a memcache plugin, so you could use it albeit lightening fast, as a key-value store. You could sidestep that pesky SQL engine, and get right down to the bare metal. But why? Memcache & Redis were already doing that & purpose built. Why indeed?

I started to argue maybe we shouldn’t be muddying the waters. I mean stick with what you know!

Read: Can on-demand consulting save startups time & money?

War was won, success declared

Around this I think was when Mongodb was declaring the war won. We had finally left SQL databases in the dustheap of history. It may or may not have inspired this popular youtube skit…

Also: 30 questions to ask a serverless fanboy

Meanwhile hadoop is losing ground. Bigquery & Redshift both speak SQL

But then something funny started to happen. It seemed there was a backlash against Mongodb. A lot of customers were losing data. (Yep that’s what durability means guys…) And the hype started reversing. Even the mighty hadoop has been losing popularity of late. How long does it take to write an EMR job versus an SQL query. Let’s be honest?

I asked myself, Is Hadoop losing ground to SQL warehouses like Redshift & Bigquery?. I wonder.

Also: What can startups learn from the DYN DNS outage?

NoSQL databases are looking for JOINs?

Recently I bumped into some interesting blog comments & discussions about how Orientdb was trying to add joins to their product.

As certain relational databases try to become No SQL databases, other No SQL databases are trying to add more complex SQL, because well somehow their product is missing something.

Also: What can startups learn from the DYN DNS outage?

Engineering truth versus fashion

43 years is a lot of years. And when we drop all the fashion trends in tech, and the new database du jour, what do we find?

There is room for No SQL databases. Yep. And the do certain things, and solve certain types of problems well. But their not general workhorses, nor can they slice and dice your data however you like. And when you get to that point in your project, you’re going to want to ask interesting questions of your data.

And surely that’s where SQL excels. It ain’t going anywhere, folks!

Also: What can startups learn from the DYN DNS outage?

Get more. Grab our exclusive monthly Scalable Startups. We share tips and special content. Our latest Why I don’t work with recruiters

Will SQL just die already?

With tons of new No-SQL database offerings everyday, developers & architects have a lot of options. Cassandra, Mongodb, Couchdb, Dynamodb & Firebase to name a few.

Join 33,000 others and follow Sean Hull on twitter @hullsean.

What’s more in the data warehouse space, you have Hadoop, which can churn through terabytes of data and get you results back before lunchtime!

So when I stumbled on this article SQL is 43 years old, I was intrigued.

Answer the questions you haven’t thought of

No-SQL databases are great if you know how you want to access the data. Users come from the users table, and that’s that!

But if later on you want to ask questions like, which users watched this video, which users are active, which users spent $100 in January? These questions may not be possible because NoSQL can’t join those other tables.

Relational databases shine when you need to aggregate your data, reorganize it, or ask unanticipated questions. And aren’t those most of the interesting questions?

Also: Top serverless interview questions for hiring aws lambda experts

Big Query, Redshift & even Hive speak SQL

I wrote that despite recent popularity in Hadoop, Redshift seems to be eating their lunch. And what would you know, surprise surprise, Amazon’s newish data warehousing solution, speaks SQL! What’s more there’s Apache Hive, which allows you to query Hadoop with, drumroll please… SQL!

Bigquery is the other major bigdata offering from none other than Google. And it too uses SQL!

Related: Which engineering roles are in greatest demand?

Still dominant

If you look at Stackoverflow’s developer survey, you’ll see that SQL is the second most popular language. Why might that be? For one thing it’s simple to learn. Enough that even business users can write simple requests, join & aggregate data.

Read: Can on-demand consulting save startups time & money?

Rugged, Proven & Open

SQL having been around so long is a fairly open standard. Sure there are extensions of it, but most of the basic stuff is there in all the products. That means you learn it once, and can interact with databases across the spectrum. That’s a win for everybody.

Also: 30 questions to ask a serverless fanboy

Business users can write it

Another under appreciated feature though is that basic queries are easy to write. They don’t require complex syntax like a hadoop job, or your favorite imperative programming language. The queries are readable, almost english-like sentences.

Given all that, it seems SQL is likely to be around for a long time to come!

Also: What can startups learn from the DYN DNS outage?

Get more. Grab our exclusive monthly Scalable Startups. We share tips and special content. Our latest Why I don’t work with recruiters

30 questions to ask a serverless fanboy

Everyone is hot under the collar again. So-called serverless or no-ops services are popping up everywhere allowing you to deploy “just code” into the cloud. Not only won’t you have to login to a server, you won’t even have to know they’re there.

As your code is called, but cloud events such a file upload, or hitting an http endpoint, your code runs. Behind the scene through the magic of containers & autoscaling, Amazon & others are able to provision in milliseconds.

Join 32,000 others and follow Sean Hull on twitter @hullsean.

Pretty cool. Yes even as it outsources the operations role to invisible teams behind Amazon Lambda, Google Cloud Functions or Webtask it’s also making companies more agile, and allowing startup innovation to happen even faster.

Believe it or not I’m a fan too.

That said I thought it would be fun to poke a hole in the bubble, and throw some criticisms at the technology. I mean going serverless today is still bleeding edge, and everyone isn’t cut out to be a pioneer!

With that, here’s 30 questions to throw on the serverless fanboys (and ladies!)…

1. Security

o Are you comfortable removing the barrier around your database?
o With more services, there is more surface area. How do you prevent malicious code?
o How do you know your vendor is doing security right?
o How transparent is your vendor about vulnerabilities?

Also: Myth of five nines – Why high availability is overrated

2. Testing

o How do you do integration testing with multiple vendor service components?
o How do you test your API Gateway configurations?
o Is there a way to version control changes to API Gateway configs?
o Can Terraform or CloudFormation help with this?
o How do you do load testing with a third party db backend?
o Are your QA tests hitting the prod backend db?
o Can you easily create & destroy test dbs?

Related: 5 ways to move data to amazon redshift

3. Management

o How do you do zero downtime deployments with Lambda?
o Is there a way to deploy functions in groups, all at once?
o How do you manage vendor lock-in at the monitoring & tools level but also code & services?
o How do you mitigate your vendors maintenance? Downtime? Upgrades?
o How do you plan for move to alternate vendor? Database import & export may not be ideal, plus code & infrastructure would need to be duplicated.
o How do you manage a third party service for authentication? What are the pros & cons there?
o What are the pros & cons of using a service-based backend database?
o How do you manage redundancy of code when every client needs to talk to backend db?

Read: Why were dev & ops siloed job roles?

4. Monitoring & debugging

o How do you build a third-party monitoring tool? Where are the APIs?
o When you’re down, is it your app or a system-wide problem?
o Where is the New Relic for Lambda?
o How do you degrade gracefully when using multiple vendors?
o How do you monitor execution duration so your function doesn’t fail unexpectedly?
o How do you monitor your account wide limits so dev deploy doesn’t take down production?

Also: Are SQL databases dead?

5. Performance

o How do you handle startup latency?
o How do you optimize code for mobile?
o Does battery life preclude a large codebase on client?
o How do you do caching on server when each invocation resets everything?
o How do you do database connection pooling?

Also: Is Amazon too big to fail?

Get more. Grab our exclusive monthly Scalable Startups. We share tips and special content. Our latest Why I don’t work with recruiters

What products & improvements are new on AWS?

Amazon is releasing new products & services to it’s global cloud compute network at a rate that has all of our heads spinning.

Join 32,000 others and follow Sean Hull on twitter @hullsean.

Here’s new stuff worth mentioning around databases & data.

1. For ETL – AWS GLUE

Moving data from your transactional MySQL or Arora database to your reporting database isn’t always easy.

In the past you could use a service like xplenty or Alooma.

Now Amazon themselves are getting into the ETL game, providing a new service called Glue.

Also: RDS or Mysql? 10 use cases

2. Query S3 with Athena

Chances are if you’re using AWS for anything, you’ve got data in S3. And wouldn’t it be nice to pick that apart and dig through it, where it sits?

Oracle had a feature called “external tables” and MySQL had something similar. Now Amazon is offering that native within it’s own cloud universe. Thanks to some tricky lambda code, now you can do that. Don’t worry how they did it, because it’s been packaged into a nice easy service for your use!

Related: When you have to take the fall – consulting war stories

3. Business Intelligence with QuickSight

If you’re a data driven startup, and who isn’t these days, you’re going to have a business unit building reports. Tableau or Looker may be in your wheelhouse.

Amazon is obviously seeing the opportunity here, and competing with their own partners. Check out Amazon Quicksight for details.

Read: Is upgrading RDS like a sh*t storm that will not end?

4. Expanded RDS

RDS is obviously a very popular offering. And even though zero downtime is very hard to achieve with RDS, you’ll save plenty on DBAs and admins you don’t have to hire!

If you hadn’t heard, there is now MariaDB support. And with it, there’s a migration from MySQL to Mariadb as well.

Using Mariadb may bring you performance advantages & improvements. But RDS may mitigate this by productize & standarizing things.

You can also now move encrypted snapshots across regions. In my view this isn’t really a new feature, but rather fixing something that was broken before. The previous limitation was really more a symptom of their global network of data centers, than any built feature per se.

Also: Is the difference between dev & ops a four-letter word?

5. Expanded Redshift

As I’ve blogged before, everybody is excited about Redshift these days.

Amazon has introduced some new features.

o better loading of sorted data

This is done behind the scenes to load data quickly, and keep it stored efficiently. No more vacuuming after a big load!

o user & database rate limiting

Limit connections on a per user or per database level. Useful!

o storage estimates on analyze

When you perform the analyze command, you can get storage information so it’s easier to decide datatypes & compression type. Nifty!

Also: Is Redshift outpacing Hadoop as the big data warehouse for startups?

Get more. Grab our exclusive monthly Scalable Startups. We share tips and special content. Our latest Why I don’t work with recruiters

5 essential tools for Redshift DBA’s

redshift dba tools

Like every other startup, you’ve built everything on AWS or are moving there quickly.

So it makes sense & is an easy win to build your analytics & bigdata engine on AWS too. Enter Redshift to the rescue.

Unlike the old days of Oracle where you had career DBAs to handle your data & hold the keys to the data kingdom, these days nobody is managing the data. What makes matters worse is the documentation is weak, and it hasn’t been out long enough to warrant good books on the topic.

But there’s hope!

Here are five great tools to help you in your day-to-day management of Redshift

Join 28,000 others and follow Sean Hull on twitter @hullsean.

The Github Redshift Utils Page has the whole package, while I link to individual scripts below.

1. Slow queries

Just like MySQL, Postgres & Oracle, your database performance & responsiveness lives & dies by the SQL that you write. And just like all those databases, you want to look at the slowest queries, to know where to tune. Spend your time on the worst offenders first.

Redshift Slow Queries Report.

Also: 5 Ways to get data into REdshift

2. Fragmented Tables

You add data, you delete data. And just like all the other relational databases we know & love, this process leaves gaps. New data is still added at the high water mark, and full table scans still read those empty blocks.

The solution is the vacuum command, but which tables should we run it on?

Run this script & find out which tables need maintenance.

Redshift Display tables needing vacuum

Related: Is Redshift outpacing Hadoop for Startups & bigdata

3. Analyze Schema Compression

Getting your schema compression right on your data in Redshift turns out to be essential to performance too. Each datatype has a recommended compression type that works best for it. You can let Redshift decide when it first loads data, but it doesn’t always guess correctly.

This tool will help you analyze your data, and set compression types properly.

Redshift analyze schema compression tool

Read: 5 Reasons to move data to Amazon Redshift

4. Audit Users

Once you have your Redshift cluster up & running, you’ll setup users for applications to connect as. Each will have different privileges for objects & schemas in your database. You’re auditing those right? ๐Ÿ™‚

This script will output who can read & write what. Useful to ensure your application isn’t overly loose with permissions. Shoot for least privileges where possible.

Redshift Audit User Privileges.

Redshift Audit Table Privileges

Also: Is data your dirty little secret

5. Lambda Data Loader

Want to automatically load data into Redshift? Download this handy Lambda based tool to respond to S3 events. Whenever a new file appears, it’s data will get loaded into Redshift. Complete with metadata configuration control in Dynamodb.

Redshift Lambda Data Loader

Also: Why is everybody suddenly talking about Amazon Redshift?

Get more. Grab our exclusive monthly Scalable Startups. We share tips and special content. Our latest Why I don’t work with recruiters

Five ways to get your data into Redshift

redshift data pipeline

Everybody is hot under the collar this data over Redshift. I heard one customer say, a query that took 10 Hours before now finishes in under a minute. Without modification. When businesses see 600 times speedup, that can change the way they do business.

Join 32,000 others and follow Sean Hull on twitter @hullsean.

What’s more Redshift is easy to deploy. No complicated licenses like the Oracle days. No hardware, just create your cluster & go.

So you’ve made the decision, and you have data in your transactional database, MySQL RDS or Postgres. Now what?

Here are some systems that will help you synchronize data on the regular. And keep it in sync. Most of these are near real-time, so you can expect reports to be looking at the data your business created today.

1. RJ Metrics Pipeline

One of the simplest options, RJ Metrics Pipeline. Setup a trial account, configure your Redshift credentials in the warehouse section (port, user, password, endpoint) and save. Then configure your data source. For MySQL specify hostname, user, password & port. You get the option to go through an ssh tunnel for security. That’s good. You’ll also be given the grant code to create a user in MySQL for RJM.

rjmetrics table config screen

RJM uses a primary or unique key to figure out which rows have changed. Well that’s not completely true. Only if you’re using incremental refresh. If you’re using complete refresh, then it just selects all the data & replaces it each time.

The user interface is a bit clunky. You have to go in and CONFIGURE EACH TABLE you want to replicate. There’s no REPLICATE-ALL option. This is a pain. If you have 500 tables, it might take hours to configure them all.

Also since RJM isn’t CDC (change data capture) based, it won’t be as close to real-time as some of the other options.

Still RJM works and it’s pretty point-n-click.

Also: Is Amazon too big to fail?

2. xplenty

xplenty is really a lot more than just a sync tool. It’s a full featured ETL system. Want to avoid writing tons of python jobs to convert datatypes, transform 0 to paid & 1 to free, things like that? Well xplenty is made to allow building ETL systems without code.

xplenty main dashboard

It’s a bit complex to setup at first, but very full featured. It is the DIY developer or DBAs tool of the bunch. If you need hardcore functionality, xplenty seems to have it.

Also: When hosting data on Amazon turns bloodsport?

Is Data your dirty little secret?

3. Alooma

Alooma might possibly be the most interesting of the bunch.

After a few stumbles during the setup process, we managed to get this up and running smoothly. Again as with xplenty & Fivetran, it uses CDC to grab changes from the MySQL binlogs. That means you get near realtime.

alooma dashboard

Although it’s a bit more complex to setup than Fivetran, it gives you a lot more. There’s excellent visibility around data errors, which you *will* have. Knowing where they happen, means your data team can be very proactive. This is great for the business.

What’s more there is a python based Code Engine which allows you to write bits of code that transform data in the pipeline. That’s huge! Want to do some simple ETL, this is a way to do that. Also you can send notifications, or requeue events. All this means you get state of the art pipeline, with good configurability & logging.

Read: Is aws a patient that needs constant medication?

4. Fivetran

Fivetran is super point-n-click. It is CDC based like Flydata & Alooma, so you’re gonna get near realtime sync with low overhead. It monitors your binlogs for changed data, and ships it to Redshift. No mess.

The dashboard is simple, the setup is trivial, and it just seems to work. Least pain, best bang.

Related: Does Amazon eat it’s own dogfood?

5. Other options

There are lots of other ways to get data into Redshift.

Flydata

I did manage to get Flydata working at a customer last year. It’s a very viable option. I wrote at length about that solution I’ll leave you to read all about it there.

AWS Data Pipeline

I’ve started to kick the tires of AWS Data Pipeline but haven’t decided if it’s the best option for customers.

Nightly rebuild

The Donors Choose Tech Blog posted about their project which can move data from postgres to redshift. You can find the project here.

This will do a *full* reload each night, so if your db is too big for that, it might need modifications. Also if you’re using MySQL as source db you’ll need to change code. One thing I found in there was Perl & Sed commands to transform your source schema CREATE & ALTER statements into Redshift compatible ones. That in itself is worth a look.

Lambda to the rescue

The awslabs github team has put together a lambda-based redshift loader. This might be just what you need. Remember thought that’ll you’ll need to deliver your source data in CSV files to S3 on the regular. So you’ll need some method to dump it. But if you have that half of the equation, this is ideal.

Data Migration Serve or DMS

This appears to have supported Redshift early on, but does not appear to do so now. I’ve gotten conflicting reports, so I should dig a bit more. Anybody want to comment on this one?

Tungsten

I tried & tried & tried to get Tungsten to work. I did have some success but was still blocked by data problems which remained unresolved. To my mind the project is still broken or at least very buggy.

Also: Is AWS too complex for small dev teams?

Get more. Grab our exclusive monthly Scalable Startups. We share tips and special content. Our latest Why I don’t work with recruiters

Why is everyone suddenly talking about Amazon Redshift?

par accel redshift

It seems like all I hear these days is Redshift, Redshift, Redshift!

I met up with a recruiter today. We talked about this & that. The usual. Then when he came to the topic of technology he said,

“yeah it seems as though suddenly everybody is looking for Redshift & Snowflake”

As I blogged about before, I don’t work with recruiters, I learn a lot from them.

Join 32,000 others and follow Sean Hull on twitter @hullsean.

Luckily I got to cut my teeth on Redshift about a year ago. I was senior database engineer managing Amazon & MySQL RDS, and they wanted to build a data warehouse. Bingo!

Here’s the big takeaway from my discussion today. Recruiters have their fingers on the pulse!

1. We need an Amazon expert

Here’s what else I’m hearing everywhere. “We’re migrating to AWS, can you help?” Complexity & confusion around the new virtual networking, moving into the cloud, and tuning applications & components to get the same performance as before. All of these are real & present needs for firms.

Related: Is data your dirty little secret?

2. We need a Redshift expert

Amazon bought Par Accel, a bleedingly fast warehouse. It uses SQL. It looks like Postgres, and handles petabytes. You read that petabytes! It’s so good in fact that it seems a lot of folks are now dumping Hadoop.

Incredible as that sounds, Redshift is delivering *that* kind of speed on that kind of big data. Wow! What’s more you skip the whole Hadoop cycle of write, test, debug, schedule job, fix bugs, and stir. With SQL you bring back the iterative agile process!

Read: 5 cloud challenges I’m thinking about today

3. We need a Hadoop expert

Ok, for those enterprises who aren’t sold on Redshift yet, there is still a ton of Hadoop out there. And for good reason.

Apache Spark is also getting really big now too. It’s an easier to manage successor to Hadoop, based around much of the same concepts.

Also: 5 core pieces of the Amazon cloud puzzle to get your project off the ground

4. We need strong Python skills

Python is everywhere. Amazon’s command line interface is python based. You see it everywhere. If it’s not in your wheelhouse get it there!

Also: Why Dropbox didn’t have to fail

5. We need communicators

Another interesting thing the recruiter said

“I was surprised & a little shocked that you suggested we meet for coffee. Most developers are hard to get out to have a conversation with.”

Good communicators are as in-demand as ever! Being able to and happy to talk with people who aren’t deeply technical, and distill complex technical jargon into plain english. And do that with a smile too & enjoy it?

That’s special!

Also: Should we be muddying the waters? Use cases for MySQL & Mongodb

Get more. Grab our exclusive monthly Scalable Startups. We share tips and special content. Our latest Why I don’t work with recruiters