Join 28,000 others and follow Sean Hull on twitter @hullsean.
Amazon is rolling out new database offerings at a rapid clip. I wondered Did MySQL and Mongodb just have a beautiful baby called Aurora? That was last month.
Another that’s been out for a while is the data warehouse offering called RedShift.
1. old-fashioned SQL interface
Ok, yes Redshift can support petabyte databases and this in itself is staggering to consider. But just after you digest that little fact, you’ll probably discover that it’s SQL compatible.
This is a godsend. It means the platform can leverage all of the analytical tools already in the marketplace, ones that your organization is already familiar. Many are already certified on RedShift such as Looker and Chart IO.
Also: Are SQL Databases Dead?
2. Lots of ways to load data
After you build your first cluster, the first question on your mind will be, “How do I get my data into RedShift?” Fortunately there are lots of ways.
Stage in S3 & use COPY
Everyone using AWS is already familiar with S3, and RedShift uses this as a staging ground. Create a bucket for your csv or other datafiles, then parallel load them with the special COPY command.
For those coming from the Oracle world, this is like SQL*Loader, which doesn’t go through the SQL engine, but directly loads data as blocks into datafiles. Very fast, very parallel.
AWS Data Pipeline
Some folks are leveraging the AWS Data Pipeline to copy MySQL tables straight into RedShift.
FlyData for Amazon MySQL RDS
I’m in the process of evaluating FlyData sync. This is a service based solution which connects to your Amazon RDS for MySQL instance, capturing binload data much like Oracle’s GoldenGate does, and ships it across to RedShift for you.
If you have constantly changing data, this may be ideal as you don’t have a one-shot dataload option, implied by the basic COPY command solution.
Read: What is ETL and why is it important?
3. Very fast or very big nodes
There are essentially two types of compute nodes for RedShift, DW2 are dense compute running on SSD. As we all know, these are very fast solid state memory drives, and bring huge disk I/O benefits. Perfect for a data warehouse. They cost about $1.50/Tb per hour.
The second type is DW1 or so-called dense storage nodes. These can scale up to a petabyte of storage. They are running on traditional storage disks so aren’t SSD fast. They’re also around $0.50/Tb per year. So a lot cheaper.
Amazon recommends if you’re less than 1Tb of data, go with Dense Compute or DW2. That makes sense as you get SSD speed right out of the gates.
Related: What is a data warehouse?
4. distkeys, sortkeys & compression
The nice thing about NoSQL databases is you don’t have to jump through all the hoops trying to shard your data with a traditional database like MySQL. That’s because distribution is supported right out of the box.
When you create tables you’ll choose a distkey. You can only have one on a table, so be sure it’s the column you join on most often. A timestamp field, or user_id, perhaps would make sense. You’ll choose diststyle as well. ALL means keep an entire copy of the table on each node, key means organize based on this distkey, and EVEN the default means let Amazon try to figure it out.
RedShift also has sortkeys. You can have more than one of these on your table, and they are something like b-tree indexes. They order values, and speed up sorting.
Check: 8 Questions to ask an AWS expert
5. Compression, defragmentation & constraints
Being a columnar database, Redshift also supports collumn encodings or compression. There is LZO often used for varchar columns, bytedict and runlength are also common. One way to determine these is to load a sample of data, say 100,000 rows. From there you can ANALYZE COMPRESSION on the table, and RedShift will make recommendations.
A much easier way however, is to use the COPY command with COMPUPDATE ON. During the initial load, this will tell RedShift to analyze data as it is loaded and set the column compression types. This is by far the most streamlined approach.
RedShift also supports Table constraints, however they don’t restrict data. Sounds useless right? Execept they do inform the optimizer. What’s that mean? If you know you have a primary key id column, tell RedShift about it. No it won’t enforce that but since your source database is, you’re able to pass along that information to RedShift for optimizing queries.
You’ll also find some of the defragmentation options from Oracle & MySQL present in Redshift. There is vacuum which reorganizes the table & resets the high water mark, while it is still online for updates. And then there is Deep Copy which is more thorough, but takes the table offline to do it. It’s faster, but locks the table.
o deep copy
Related: Is Oracle killing MySQL?