Category Archives: Database Management

5 essential tools for Redshift DBA’s

redshift dba tools

Like every other startup, you’ve built everything on AWS or are moving there quickly.

So it makes sense & is an easy win to build your analytics & bigdata engine on AWS too. Enter Redshift to the rescue.

Unlike the old days of Oracle where you had career DBAs to handle your data & hold the keys to the data kingdom, these days nobody is managing the data. What makes matters worse is the documentation is weak, and it hasn’t been out long enough to warrant good books on the topic.

But there’s hope!

Here are five great tools to help you in your day-to-day management of Redshift

Join 28,000 others and follow Sean Hull on twitter @hullsean.

The Github Redshift Utils Page has the whole package, while I link to individual scripts below.

1. Slow queries

Just like MySQL, Postgres & Oracle, your database performance & responsiveness lives & dies by the SQL that you write. And just like all those databases, you want to look at the slowest queries, to know where to tune. Spend your time on the worst offenders first.

Redshift Slow Queries Report.

Also: 5 Ways to get data into REdshift

2. Fragmented Tables

You add data, you delete data. And just like all the other relational databases we know & love, this process leaves gaps. New data is still added at the high water mark, and full table scans still read those empty blocks.

The solution is the vacuum command, but which tables should we run it on?

Run this script & find out which tables need maintenance.

Redshift Display tables needing vacuum

Related: Is Redshift outpacing Hadoop for Startups & bigdata

3. Analyze Schema Compression

Getting your schema compression right on your data in Redshift turns out to be essential to performance too. Each datatype has a recommended compression type that works best for it. You can let Redshift decide when it first loads data, but it doesn’t always guess correctly.

This tool will help you analyze your data, and set compression types properly.

Redshift analyze schema compression tool

Read: 5 Reasons to move data to Amazon Redshift

4. Audit Users

Once you have your Redshift cluster up & running, you’ll setup users for applications to connect as. Each will have different privileges for objects & schemas in your database. You’re auditing those right? ūüôā

This script will output who can read & write what. Useful to ensure your application isn’t overly loose with permissions. Shoot for least privileges where possible.

Redshift Audit User Privileges.

Redshift Audit Table Privileges

Also: Is data your dirty little secret

5. Lambda Data Loader

Want to automatically load data into Redshift? Download this handy Lambda based tool to respond to S3 events. Whenever a new file appears, it’s data will get loaded into Redshift. Complete with metadata configuration control in Dynamodb.

Redshift Lambda Data Loader

Also: Why is everybody suddenly talking about Amazon Redshift?

Get more. Grab our exclusive monthly Scalable Startups. We share tips and special content. Our latest Why I don’t work with recruiters

Five ways to get your data into Redshift

redshift data pipeline

Everybody is hot under the collar this data over Redshift. I heard one customer say, a query that took 10 Hours before now finishes in under a minute. Without modification. When businesses see 600 times speedup, that can change the way they do business.

Join 32,000 others and follow Sean Hull on twitter @hullsean.

What’s more Redshift is easy to deploy. No complicated licenses like the Oracle days. No hardware, just create your cluster & go.

So you’ve made the decision, and you have data in your transactional database, MySQL RDS or Postgres. Now what?

Here are some systems that will help you synchronize data on the regular. And keep it in sync. Most of these are near real-time, so you can expect reports to be looking at the data your business created today.

1. RJ Metrics Pipeline

One of the simplest options, RJ Metrics Pipeline. Setup a trial account, configure your Redshift credentials in the warehouse section (port, user, password, endpoint) and save. Then configure your data source. For MySQL specify hostname, user, password & port. You get the option to go through an ssh tunnel for security. That’s good. You’ll also be given the grant code to create a user in MySQL for RJM.

rjmetrics table config screen

RJM uses a primary or unique key to figure out which rows have changed. Well that’s not completely true. Only if you’re using incremental refresh. If you’re using complete refresh, then it just selects all the data & replaces it each time.

The user interface is a bit clunky. You have to go in and CONFIGURE EACH TABLE you want to replicate. There’s no REPLICATE-ALL option. This is a pain. If you have 500 tables, it might take hours to configure them all.

Also since RJM isn’t CDC (change data capture) based, it won’t be as close to real-time as some of the other options.

Still RJM works and it’s pretty point-n-click.

Also: Is Amazon too big to fail?

2. xplenty

xplenty is really a lot more than just a sync tool. It’s a full featured ETL system. Want to avoid writing tons of python jobs to convert datatypes, transform 0 to paid & 1 to free, things like that? Well xplenty is made to allow building ETL systems without code.

xplenty main dashboard

It’s a bit complex to setup at first, but very full featured. It is the DIY developer or DBAs tool of the bunch. If you need hardcore functionality, xplenty seems to have it.

Also: When hosting data on Amazon turns bloodsport?

Is Data your dirty little secret?

3. Alooma

Alooma might possibly be the most interesting of the bunch.

After a few stumbles during the setup process, we managed to get this up and running smoothly. Again as with xplenty & Fivetran, it uses CDC to grab changes from the MySQL binlogs. That means you get near realtime.

alooma dashboard

Although it’s a bit more complex to setup than Fivetran, it gives you a lot more. There’s excellent visibility around data errors, which you *will* have. Knowing where they happen, means your data team can be very proactive. This is great for the business.

What’s more there is a python based Code Engine which allows you to write bits of code that transform data in the pipeline. That’s huge! Want to do some simple ETL, this is a way to do that. Also you can send notifications, or requeue events. All this means you get state of the art pipeline, with good configurability & logging.

Read: Is aws a patient that needs constant medication?

4. Fivetran

Fivetran is super point-n-click. It is CDC based like Flydata & Alooma, so you’re gonna get near realtime sync with low overhead. It monitors your binlogs for changed data, and ships it to Redshift. No mess.

The dashboard is simple, the setup is trivial, and it just seems to work. Least pain, best bang.

Related: Does Amazon eat it’s own dogfood?

5. Other options

There are lots of other ways to get data into Redshift.

Flydata

I did manage to get Flydata working at a customer last year. It’s a very viable option. I wrote at length about that solution I’ll leave you to read all about it there.

AWS Data Pipeline

I’ve started to kick the tires of AWS Data Pipeline but haven’t decided if it’s the best option for customers.

Nightly rebuild

The Donors Choose Tech Blog posted about their project which can move data from postgres to redshift. You can find the project here.

This will do a *full* reload each night, so if your db is too big for that, it might need modifications. Also if you’re using MySQL as source db you’ll need to change code. One thing I found in there was Perl & Sed commands to transform your source schema CREATE & ALTER statements into Redshift compatible ones. That in itself is worth a look.

Lambda to the rescue

The awslabs github team has put together a lambda-based redshift loader. This might be just what you need. Remember thought that’ll you’ll need to deliver your source data in CSV files to S3 on the regular. So you’ll need some method to dump it. But if you have that half of the equation, this is ideal.

Data Migration Serve or DMS

This appears to have supported Redshift early on, but does not appear to do so now. I’ve gotten conflicting reports, so I should dig a bit more. Anybody want to comment on this one?

Tungsten

I tried & tried & tried to get Tungsten to work. I did have some success but was still blocked by data problems which remained unresolved. To my mind the project is still broken or at least very buggy.

Also: Is AWS too complex for small dev teams?

Get more. Grab our exclusive monthly Scalable Startups. We share tips and special content. Our latest Why I don’t work with recruiters

Why is everyone suddenly talking about Amazon Redshift?

par accel redshift

It seems like all I hear these days is Redshift, Redshift, Redshift!

I met up with a recruiter today. We talked about this & that. The usual. Then when he came to the topic of technology he said,

“yeah it seems as though suddenly everybody is looking for Redshift & Snowflake”

As I blogged about before, I don’t work with recruiters, I learn a lot from them.

Join 32,000 others and follow Sean Hull on twitter @hullsean.

Luckily I got to cut my teeth on Redshift about a year ago. I was senior database engineer managing Amazon & MySQL RDS, and they wanted to build a data warehouse. Bingo!

Here’s the big takeaway from my discussion today. Recruiters have their fingers on the pulse!

1. We need an Amazon expert

Here’s what else I’m hearing everywhere. “We’re migrating to AWS, can you help?” Complexity & confusion around the new virtual networking, moving into the cloud, and tuning applications & components to get the same performance as before. All of these are real & present needs for firms.

Related: Is data your dirty little secret?

2. We need a Redshift expert

Amazon bought Par Accel, a bleedingly fast warehouse. It uses SQL. It looks like Postgres, and handles petabytes. You read that petabytes! It’s so good in fact that it seems a lot of folks are now dumping Hadoop.

Incredible as that sounds, Redshift is delivering *that* kind of speed on that kind of big data. Wow! What’s more you skip the whole Hadoop cycle of write, test, debug, schedule job, fix bugs, and stir. With SQL you bring back the iterative agile process!

Read: 5 cloud challenges I’m thinking about today

3. We need a Hadoop expert

Ok, for those enterprises who aren’t sold on Redshift yet, there is still a ton of Hadoop out there. And for good reason.

Apache Spark is also getting really big now too. It’s an easier to manage successor to Hadoop, based around much of the same concepts.

Also: 5 core pieces of the Amazon cloud puzzle to get your project off the ground

4. We need strong Python skills

Python is everywhere. Amazon’s command line interface is python based. You see it everywhere. If it’s not in your wheelhouse get it there!

Also: Why Dropbox didn’t have to fail

5. We need communicators

Another interesting thing the recruiter said

“I was surprised & a little shocked that you suggested we meet for coffee. Most developers are hard to get out to have a conversation with.”

Good communicators are as in-demand as ever! Being able to and happy to talk with people who aren’t deeply technical, and distill complex technical jargon into plain english. And do that with a smile too & enjoy it?

That’s special!

Also: Should we be muddying the waters? Use cases for MySQL & Mongodb

Get more. Grab our exclusive monthly Scalable Startups. We share tips and special content. Our latest Why I don’t work with recruiters

Is data your dirty little secret?

data comparison cloud

While I was fumbling for the dictionary to figure out what polyglot persistence was, the CTO had decided to build a warehouse on Redshift.

“Everybody’s moving data there.” He declared. I looked on quizzically.

Join 32,000 others and follow Sean Hull on twitter @hullsean.

“That’s a very new database engine”, I chimed in. “Lets do some testing”.

And there began an adventurous ride into the bleeding edge of Amazon’s new data service offerings!

1. The data scientist comes crying

Are our transactional database we were using Amazon RDS for MySQL. It’s a great managed service that eliminates some of the headaches of doing it yourself. I wrote about thisRDS or MySQL Use Cases.

We needed some way to get data over to Redshift. We evaluated AWS Data Pipeline, but it wasn’t realtime enough. In a pinch we decided on a service called Flydata. After weeks of effort to get it setup & administered we had it running smoothly.

I since discovered some pipelining solutions dedicated to Redshift such as Alooma – modern data plumbing, RJMetrics pipeline and Domo. I *did not* manage to get Tungsten working. It supports redshift on paper, but has a lot of growing up to do.

Until one data when the data scientist shows up at my desk. “We have problems in our data on redshift.”. I look back confused. “Are you sure? Can you tell me where you’re seeing that?” I respond.

Also: When hosting data on Amazon turns bloodsport

2. Deleted data reappears in Redshift!

He sends me over some queries, that I rerun myself. I see extra data in Redshift too, data that had been deleted in MySQL. Strange. We dig deeper together trying to figure out what’s happening.

We find that the tables with extra data are child tables of a parent where data was deleted. Imagine Citibank deletes a customer, they also want to delete the records for monthly bills. Otherwise those will just be hanging around, and won’t match up anymore with a parent record for a customer. In real life Citibank probably doesn’t delete like this but it’s a helpful example.

The first thing I do is open a ticket with Flydata. After all we hadn’t gotten any errors logged. Things *must* be running correctly.

After highlighting the severity of the issue, we setup a conference call with Flydata. Digging further they discover the problem. Child table data can’t get deleted on Redshift, because it doesn’t support ON DELETE CASCADE. Wait what?

Turns out Flydata makes use of the MySQL transaction log to move data. In mysql to mysql replication this works fine because downstream you also have MySQL. It also implements on delete cascade so those child records will get cleaned up correctly. Since Redshift doesn’t have this, there’s no way for Flydata to instruct Redshift what to do. Again I said, wait what?

My surprise wasn’t that a new unproven technology like Redshift had a lot of holes & missing features. My surprise was that Flydata was just silently ignoring the problem. No logged messages to tell the customer about inconsistencies. At least?

Related: Is Amazon too big to fail?

3. The problem – comparing data

As you might imagine, this is a terrible way to find out about data problems. As the person tasked with moving data between these systems, eyes were on me. My thought was, we chose a service-based solution, so they’re manage data movement. If there’s a problem, they’ll surely alert us.

From there the conversation became, ok, how do we figure out where all these data differences are? Is it widespread or isolated to a few tables? Can we mitigate it with changed queries? Cleanup on a daily basis? These are some questions that’ll immediately come to mind.

To answer them we needed a way to compare table data across different databases. This is hard to do within a homogenous environment where server versions & datatypes are likely to be the same. It is much more complicated when you’re working across heterogenous systems.

Read: 5 Reasons to move data to amazon redshift

4. Build some way to spot check data

Although this still doesn’t seem a solved problem, there are some tools. One way is to perform checksums on tables & rows. These can then be compared to find differences.

This drew me to find Jason Friedman’s
table hash script on Github. It can work across MySQL, Postgres & redshift. Pretty cool stuff if you ask me.

One problem remains. Databases are always in flux. As such you may find discrepancies based on data that hasn’t been moved yet. Data that’s just changed in the last few minutes.

If you refresh data nightly, you may for example be able to stop a slave to compare data at an instant in time.

Also: Is Redshift outpacing hadoop as the warehouse for startups?

5. The mentality: treat data as a product & monitor

Solving tough problems like these is a work in progress. What it taught me is that:

You should own your data pipeline

This allows you to be vigilant about monitoring, throw errors if data is different, and ultimately treat data as a product. Owning the pipeline will mean you can build monitoring around stages, and automate spot checks on your data.

You won’t get it perfect, but you want to know when it isn’t.

Also: 5 core pieces of the Amazon puzzle to get your project off the ground

Get more. Grab our exclusive monthly Scalable Startups. We share tips and special content. Our latest Why I don’t work with recruiters

Should we be muddying the relational waters? Use cases for MySQL & Mongodb

muddy sewer tunnels

Many of you know I publish a newsletter monthly. One thing I love about it is that after almost a decade of writing it regularly, the list has grown considerably. And I’m always surprised at how many former colleagues are actually reading it.

So that is a really gratifying thing. Thanks to those who are, and if you’re not already on there, signup here.

Join 28,000 others and follow Sean Hull on twitter @hullsean.

Recently a CTO & former customer of mine reached out. He asked:

“I’m interested to hear your thoughts on the pros and cons of using a json column to embed data (almost like a poor-man’s Mongo) vs having a separate table for the bill of materials.”

Interesting question. Here are my thoughts.

1. Be clean or muddy?

In my view, these type of design decisions are always about tradeoffs.  

The old advice was normalize everything to start off with. ¬†Then as you’re performance tuning, denormalize in special cases where it’ll eliminate messy joins. ¬†The special cases would then also need to be handled at insert & update time, as you’d have duplication of data.

NoSQL & mongo introduce all sorts of new choices.  So too Postgres with json column data.  

We know that putting everything in one table will be blazingly fast, as you don’t need to join. ¬†So reads will be cached cleanly, and hopefully based on single ID or a small set of ID lookups. ¬†

Also: Is the difference between dev & ops a four-letter word?

2. Go relational

For example you might choose MySQL or Postgres as your datastore, use it for what it’s good at. ¬†Keep your data in rows & columns, so you can later query it in arbitrary ways. ¬†That’s the discipline up front, and the benefit & beauty down the line.

I would shy away from the NoSQL add-ons that some relational vendors have added, to compete with their newer database cousins. This starts to feel like a fashion contest after a while.

Related: Is automation killing old-school operations?

3. Go distributed

If you’d like to go the NoSQL route, for example you could choose Mongodb. You’ll gain advantages like distributed-out-of-the-box, eventually consistent, and easy coding & integration with applications.

Downside being you’ll have to rearrange and/or pipeline to a relational or warehouse (redshift?) if & when you need arbitrary reports from that data. ¬†For example there may be new reports & ways of slicing & dicing the data that you can’t forsee right now.

Read: Do managers underestimate operational cost?

4. Hazards of muddy models

Given those two options, I’m erring against the model of muddying the waters. ¬†My feeling is that features like JSON blobs in Postgres, and the memcache plugin in MySQL are features that the db designers are adding to compete in the fashion show with the NoSQL offerings, and still keep you in their ecosystem. ¬†But those interfaces within the relational (legacy?) databases are often cumbersome and clunky compared to their NoSQL cousins like Mongo.

Also: Is the difference between dev & ops a four-letter word?

5. Tradeoffs of isolation

Daniel Abadi and Jose Faleiro published an interesting article on a very related topic Why MongoDB, Cassandra, HBase, DynamoDB, and Riak will only let you perform transactions on a single data item.

The upshot is that in databases you can choose *TWO* of these three characteristics. Fairness, Isolation & Throughput.

Relational databases sacrifice throughput for fairness & isolation. Distributed databases sacrifice isolation to bring you gains in throughput & horizontal scalability of writes.

That’s a lot of big words to say one simple thing.

You can’t have it both ways.

Also: Is the difference between dev & ops a four-letter word?

Get more. Grab our exclusive monthly Scalable Startups. We share tips and special content. Our latest Why I don’t work with recruiters

Are software benchmarks to blame for Volkswagens woes?

volkswagen emissions

With the recent media attention Volkswagen has gotten, a lot of folks are wondering, how could that happen? Aren’t there checks & balances?

Join 32,000 others and follow Sean Hull on twitter @hullsean.

Then I ran across this observation on Todd Hoff’s brilliant blog High Scalability


Is what Volkswagen did really any different that what happens on benchmarks all the time? Cheating and benchmarks go together like a clear conscience and rationalization. Clever subterfuge is part of the software ethos. There are many many (search google) examples. Cars are now software is a slick meme, but that transformation has deep implications. The software culture and the manufacturing culture are radically different.

What exactly does all of this mean?

1. MySQL & Aurora

I was recently chatting with a colleague of mine Bret Miller who runs DeepSQL an adaptive database platform compatible with MySQL. He said:

“We‚Äôre actually doing testing against Aurora, but we recently had a couple customers do it independently with more challenging loads. ¬† Didn‚Äôt see the performance stated in the marketing stuff. ”

My response was… “Yeah. ¬†Aurora looks to be a win on the HIGH AVAILABILITY front.¬†

On the scalability front, MySQL has certain limitations in it’s core. ¬†So i’m not surprised that the marketing material was grandiose in it’s promises. ¬†

The best way to improve mysql performance is to tune queries. ¬†As you’re writing your application, and when you want to boost performance. ¬†”

And so it goes.

Also: Can hosting data on Amazon turn bloodsport?

2. Redis & Memcached

Then I stumbled upon Salvatore Sanfilippo. He is the author of the brilliant & phenomenally successful NoSQL database called Redis. Turns out that another famous blogger was making some sweeping statements about Memcached & Redis and Salvatore ended up defending Redis in a blog post titled Clarifications about Redis.

The topic turned to benchmarks. Which lead me to another post titled
Why we don’t have benchmarks.

Heard this before?

Related: Did Airbnb have to fail?

3. Is Mongo webscale?

When Mongo was first releasing it’s benchmarks, the media went wild. And DBAs were scratching their heads. This fabulous video captures the sentiment of the time. ūüôā

Read: Is Amazon too big to fail?

4. Oracle meets David DeWitt

In the 80’s Oracle began to forbid publishing benchmarks. After seeing a research paper by David DeWitt, Larry Ellison amended the End-user-license-agreement to include the DeWitt Clause. Later other database vendors followed.

It’s easy to see why. Benchmarks by their very nature depend on so many factors. It’s inevitable that those factors will be carefully picked by each platform to highlight it’s strengths.

Also: Are SQL databases dead?

5. Product versus disks

It is inevitable that all of this continues. When we reside at the level of the business, we perceive the product & its performance through that lens.

When we dive down to the level of disks, buses, cpus, network latency, multi-tenant clouds and a myriad of other factors, the waters are never so clear.

So remember your mileage may vary and buyer beware are as true today as they ever were.

Also: 5 Things are toxic to scalability

Get more. Grab our exclusive monthly Scalable Startups. We share tips and special content. Our latest Why I don’t work with recruiters

When hosting data on Amazon turns bloodsport

reddit aws outage

There’s a strong trend to automation across the cloud. That’s a great thing for startups because it reduces operational headaches & lets them focus on building products.

Join 31,000 others and follow Sean Hull on twitter @hullsean.

But as that trend begins to touch the database tier, all sorts of complications emerge. Let’s take a look at some of the tradeoffs.

1. Database as a service trend

I was recently reading Baron Schwartz’s article on the trend to database as a service.

I work with a lot of venture backed startups & pay close attention to what’s happening in New York & SF. From where I’m standing I see a similar trend. As automation simplifies management across the application stack, from load balancers to web & search servers, the same advantages are moving to database management.

Also: How to automate MySQL analysis on Amazon RDS

2. How Amazon RDS helps

Amazon’s RDS offers firms a data solution for Oracle & SQL Server as well as MySQL. For those just starting, it offers a long list of advantages.

o quick push-button deployment in minutes
o standardized parameters settings that just work
o ability to scale up or down from the dashboard
o automated backups
o multi-az so you can sleep at night

This brings a huge advantage to startups. Many have a team of developers but aren’t large enough to need an operations team and can’t afford a dedicated database administrator.

Amazon is obviously helping these firms raise the bar. And that’s a good thing.

Related: RDS or MySQL 10 use cases

3. How Amazon RDS hurts

As you get bigger, your needs will grow too. You’ll have tens of millions of customers, and with more customers comes an even higher bar. Zero downtime becomes critical. It’s then that Amazon’s solution starts to become frustrating.

Unpredictable upgrades

MySQL upgrades on RDS are a messy activity. Amazon will restart the instance, backup the instance, perform the upgrade then restart again. Each of these restarts takes a few minutes. The whole operation may have you down for ten minutes. This becomes more frustrating when your hands are completely tied. You don’t know when or what will happen!

When you roll-your-own instance, an upgrade can be performed in a matter of seconds. No instance restarts are necessary and you can monitor the process to know exactly where you are. This is the kind of control you’re going to want if you have millions of customers relying on your site & uptime.

Unnecessary slow restarts

When you apply parameter changes on RDS, some require a MySQL restart. Amazon forces the whole server to restart, increasing this downtime from a few seconds (when you roll your own) to many minutes. And while some parameters can be changed online, Amazon can provoke some strange behavior that is not always predictable.

With the frequency of these types of changes, you’ll quickly grow tired and frustrated with RDS.

EBS Snapshots are not portable

As mentioned above Amazon uses it’s standard filesystem snapshot technology to perform backups. While this works well, it can be slow & unpredictable in a multi-tenant environment.

When you roll your own, you can take advantage of xtrabackup, and perform hot backups against your database with zero downtime. This is a real godsend. What’s more they are portable, and can be moved to any other server even ones not hosted in Amazon’s cloud!

Promoting a read-replica is slow too!

One feature that Amazon touts is creating copies or “read replicas” of your data. These are great and can facilitate easy copying of data. However promoting these again brings unnecessary restarts which are slow.

When you roll your own, you can promote a read-replica or read-only slave in seconds. A few seconds can seem invisible to end users, while minutes will be perceived as a real outage or downtime.

Read: Is zero downtime even possible with RDS?

4. Is migration an option?

So what to do? As I mentioned above, there are real advantages to startups deploying their first database. It really does help. I would argue for many it can be a good place to start.

If you’re starting to outgrow RDS and frustrated with the limitations, performance tuning headaches & unneeded downtime, luckily you have options.

Migrating off of RDS onto a physical server can be done in a number of ways.

o slave off of the master

Here you build a MySQL slave on a standard EC2 instance, with your RDS instance as the master. When you’re caught up, bring your site down temporarily. Reset the slave & set to read-write mode. Then point your webservers at your new EC2 instance and bring the site back up. If done carefully 10 to 20 seconds of downtime should be plenty.

Don’t forget to run through the process with a firedrill first!

o dump & import

Another way to move your data may be MySQLdump. This option would be slower & bring a lot more downtime, but possibly necessary in some cases.

Also: 5 Reasons to move data to Amazon Redshift

5. Speed: It’s the database

Fred Wilson says speed is the number one feature of a web application. If customers are frustrated & waiting, they may leave & not come back. On the web it can be everything.

Many firms are rushing to database as a service to simplify administration. While that’s wonderful at the beginning, as you grow performance will become more of a day-to-day concern. And when it does, the database is going to be big on your list of headaches.

Web application performance inevitably involves the database and while it does, your decision to choose database as a service may come into question. Don’t be afraid to bite the bullet and manage things yourself when that time comes.

Also: Is upgrading RDS like a shit-storm that will not end?

Get more. Grab our exclusive monthly Scalable Startups. We share tips and special content. Our latest Why I don’t work with recruiters

Replicate MySQL to Amazon Redshift with Tungsten: The good, the bad & the ugly

tungsten replicator

Heterogenous replication involves moving data from one database platform to another. This is a complicated endevour because datatypes, date & time formats, and a whole lot more tend to differ across platforms. In fact it’s so complex many enterprises simply employ a commercial solution to take away the drudgery.

Join 31,000 others and follow Sean Hull on twitter @hullsean.

Enter Tungsten, which supports these types of deployments, on platforms as Postgresql, Mongodb, Oracle, Redshift, Vertica. With custom built appliers the field is infinite!

With that I’ve set out to get things working with Amazon Redshift. If you’re still struggling with the basics check out Wrestling with bears or how I tamed Tungsten Replicator.

1. Connect to redshift

The first thing you’ll need to do is allow your Tungsten boxes to reach redshift. Seems obvious, but when you’re juggling all these apples & oranges for the first time, it may slip you mind.

Configure your AWS security group to allow tungsten boxes

Get the external IP address of your tungsten box. If it’s in DNS this will work even if ping doesn’t.


$ ping tungsten01.mydomain.net

Add 10.20.30.40/32 to your Redshift security config. I created a special group called Tungsten and added the two tungsten boxes by IP address. That’s because these machines were on a different AWS account. If they’re on the same account, you could allow the entire EC2 group, and be done.

Install psql client

The best way I found to test the connection was psql. Install that:


$ apt-get install postgresql-client

Verify your connection:


$ psql -p 5439 -h 10.20.10.20 --username=root -d dwh

Also: Are SQL Databases dead?

2. Configure S3 access

Tungsten uses S3 heavily to move data into Redshift.

(I outlined this previously in 5 Reasons to move data to Amazon Redshift.

Install s3tools package

Tungsten uses the s3cmd to interface with the Amazon S3 API. Let’s install that:


$ apt-get install s3cmd

Now edit the .s3cfg file of tungsten user. Change


[default]
access_key = AAAAAAA
secret_key = BBBBBBB

Lastly edit the tungsten /opt/continuent/share/s3-config-redshift.json. There are four parameters.


{
"awsS3Path" : "s3://tungstenbucket",
"awsAccessKey" : "AAAAAAA",
"awsSecretKey" : "BBBBBBB",
"cleanUpS3Files" : "false",
}

Related: Is Oracle killing MySQL?

3. Create tables on Redshift

In a heterogenous environment, that is where source and destination databases are different platforms, Tungsten cannot create tables for you.

It will however, give you a helping hand in the process. Enter the ddlscan tool, which scans the CREATE TABLE statements on your source database, and generates them for your target platform.

For each table in source database, there will be a stage table in Redshift:


$ ddlscan jdbc:mysql://localhost:3306/test -user sync -db test -template ddl-mysql-redshift-staging.vm > test_stage.sql

$ cat test_stage.sql
/*
SQL generated on Thu Jun 04 20:06:45 UTC 2015 by ./ddlscan utility of Tungsten

url = jdbc:mysql:thin://tungsten01.mydomain.net:3306/test?jdbcCompliantTruncation=false&zeroDateTimeBehavior=convertToNull&tinyInt1isBit=false&allowMultiQueries=true&yearIsDateType=false
user = sync
dbName = test
*/

CREATE SCHEMA test;

DROP TABLE test.stage_xxx_sean;
CREATE TABLE test.stage_xxx_sean
(
tungsten_opcode CHAR(2),
tungsten_seqno INT,
tungsten_row_id INT,
tungsten_commit_timestamp TIMESTAMP,
c1 VARCHAR(256) /* VARCHAR(64) */,
id INT,
PRIMARY KEY (tungsten_opcode, tungsten_seqno, tungsten_row_id)
);

And also a base table in redshift:


$ ddlscan jdbc:mysql://localhost:3306/test -user sync -db test -template ddl-mysql-redshift.vm > test.sql

$ cat test.sql
/*
SQL generated on Thu Jun 04 20:06:51 UTC 2015 by ./ddlscan utility of Tungsten

url = jdbc:mysql:thin://tungsten01.mydomain.net:3306/test?jdbcCompliantTruncation=false&zeroDateTimeBehavior=convertToNull&tinyInt1isBit=false&allowMultiQueries=true&yearIsDateType=false
user = sync
dbName = test
*/

CREATE SCHEMA test;

DROP TABLE test.sean;
CREATE TABLE test.sean
(
c1 VARCHAR(256) /* VARCHAR(64) */,
id INT,
PRIMARY KEY (id)
);

Lastly apply those scripts to your redshift database:


$ psql
dwh# \i file_stage.sql
dwh# \i file_table.sql

Read: Are we fast approaching cloud-mageddon?

4. Troubleshoot applier

***
Encountered “Delimiter Not Found” issue

This issue was mysterious and remains so a bit. What I did to fix it:

had an issue with the path, but fixed that:


  "awsS3Path" : "s3://tungstenbucket",

It was causing an interim bucket to be created. But that did not solve things.

Ok. So I hacked this a bit.

Anyone can help me troubleshoot what happened & why?

A. I skipped transactions

I brought the applier back online with this command.


trepctl -service redshift online -skip-seqno 1,1-100

B. I did lots of inserts & deletes on MySQL

I then did about 200 of these:


mysql> insert into test.sean values ('hi there', 20);
mysql> delete from test.sean where id = 20;

C. Now seeing data


dwh=# select * from test.sean;
                 c1                  | id 
-------------------------------------+----
 working......                       | 25
 hello sean i have an exclamation !! | 27
 hello sean i came from mysql        | 26
(3 rows)

I also set cleanupS3Files to false. Now I’m seeing files like this:
test-sean-417.csv
test-sean-418.csv
test-sean-419.csv
test-sean-420.csv

So that indicates all those INSERT followed by DELETES cleaned up things.

Also: How do I find entrepreneurial focus?

5. Test data & table changes

B. Tested INSERT

At first the csv files were getting cleanedup by Tungsten. I added this option to s3-config-redshift.json file:


"cleanUpS3Files" : "false",

Then the files are kept around so we can review them. An insert record shows up in S3 like this:


"I","417","1","2015-06-05 17:44:35.000","tungsten new csv file? ","33",null

C. Tested DELETE

A DELETE record shows up in S3 like this:


"D","419","1","2015-06-05 17:45:48.000",null,"26",null

D. Tested UPDATE

An UPDATE record shows up in S3 like this:


"D","420","1","2015-06-05 17:48:55.000",null,"31",null
"I","420","2","2015-06-05 17:48:55.000","changed message text for redshift+tungsten update","31",null

A. Tested ALTER TABLE

As mentioned previously, this is *NOT* supported. However after doing the ALTER, the applier does *NOT* go offline. Also there are no errors. That’s because Tungsten does not support these and will filter them in a heterogenous environment.

The applier *DOES* go offline, after you try a new INSERT. That’s because it gets a new record for INSERT that doesn’t match.

“trepctl status” shows the following:

pendingExceptionMessage: CSV loading failed: schema=test table=sean CSV file=/tmp/staging/redshift/staging0/test-sean-413.csv message=Wrapped org.postgresql.util.PSQLException: ERROR: Load into table ‘stage_xxx_sean’ failed. Check ‘stl_load_errors’ system table for details. (../../tungsten-replicator//samples/scripts/batch/redshift.js#145)

redshift# alter table test.sean add column c3 integer default null;

redshift# alter table test.stage_xxx_sean add column c3 integer default null;

Then I brought the applier back online:

$ trepctl -service redshift online

Then check the status. It should say ONLINE for state.


$ trepctl status
Processing status command...
NAME VALUE
---- -----
appliedLastEventId : mysqld-bin.000022:0000000000000566;-1
appliedLastSeqno : 424
appliedLatency : 300585.739
autoRecoveryEnabled : false
autoRecoveryTotal : 0
channels : 1
clusterName : redshift
currentEventId : NONE
currentTimeMillis : 1433878195573
dataServerHost : my-dw.aaaa.us-east-1.redshift.amazonaws.com
extensions :
host : my-dw.aaaa.us-east-1.redshift.amazonaws.com
latestEpochNumber : 0
masterConnectUri : thl://tungsten01.mydomain.net:2112/
masterListenUri : null
maximumStoredSeqNo : 424
minimumStoredSeqNo : 0
offlineRequests : NONE
pendingError : NONE
pendingErrorCode : NONE
pendingErrorEventId : NONE
pendingErrorSeqno : -1
pendingExceptionMessage: NONE
pipelineSource : thl://tungsten01.mydomain.net:2112/
relativeLatency : 304511.573
resourcePrecedence : 99
rmiPort : 10000
role : slave
seqnoType : java.lang.Long
serviceName : redshift
serviceType : local
simpleServiceName : redshift
siteName : default
sourceId : my-dw.aaaa.us-east-1.redshift.amazonaws.com
state : ONLINE
timeInStateSeconds : 351940.007
timezone : GMT
transitioningTo :
uptimeSeconds : 600921.759
useSSLConnection : false
version : Tungsten Replicator 4.0.0 build 18
Finished status command...
$

Lastly, let’s see what’s in the table, fire up the postgresql shell and take a look:


dwh=# select * from test.sean;
c1 | id | c3
---------------------------------------------------+----+----
working...... | 25 |
hello sean i have an exclamation !! | 27 |
hello will i break? | 30 |
some more records | 32 |
tungsten new csv file? | 33 |
another tungsten csv file? | 34 |
changed message text for redshift+tungsten update | 31 |
(7 rows)

Also: Was Fred Wilson wrong about Apple?

Get more. Grab our exclusive monthly Scalable Startups. We share tips and special content. Our latest Why I don’t work with recruiters

Wrestling with bears or how I tamed Tungsten replicator

tungsten replicator

I just dove into Tungsten replicator very recently as I need to replicate from Amazon RDS to Redshift. I’d heard a lot of great things about Tungsten, but had yet to really dig my heels in.

Join 28,000 others and follow Sean Hull on twitter @hullsean.

I fetched the binary and began to dig through the docs. Within a day I felt like I was sinking in quicksand. Why was this thing so darn complicated? To my mind unix software is a config file simple.cfg, a logfile simple.log, a daemon simpled. Open some ports & voila you’re cooking.

Unfortunately for beginners Tungsten took a very different approach. Although they support .ini files to config, they seem to encourage these huge commands to “configure” which means generate a config file in Tungsten speak, and install which literally means install the software for you into /opt/continuent.

After various posts to the forums, and a lot of head scratching, I discovered the cookbook. At first I thought this referenced puppet or chef type cookbooks. But it’s something different. It allowed me to setup a very basic tungsten just to see things working.

Tungsten supports all sorts of “topologies”. For this example I am just doing master-slave. There are two nodes and each has mysql running on it. Node1 serves as the master, and runs the tungsten replicator (master node service). And node2 serves as the slave and runs the tungsten applier (slave node service).

Good luck and hope this helps others speedup the learning curve!

1. Download tarball (on master)

Note the forums indicated they may be moving off of google code. So this url may change.


$ cd /tmp
$ wget http://downloads.tungsten-replicator.org/download.php?file=tungsten-replicator-oss-4.0.0-18.tar.gz

Also: Why Airbnb didn’t have to fail

2. Expand the tarball (on master)

Use your vast unix skills to expand the tarball!


$ mv download.php\?file\=tungsten-replicator-oss-4.0.0-18.tar.gz tungsten.tgz
$ tar xvzf tungsten.tgz
$ mv tungsten-replicator-4.0.0-18 stage

Also: Is the difference between dev & ops a four-letter word?

3. Install MySQL (each box)

Hopefully you’ve done this before. Pretty straightforward:


$ apt-get install mysql-client mysql-server

Also: Are SQL Databases Dead?

4. Edit cookbook files (on master)

Inside /tmp/stage/cookbook you’ll need to make a few simple edits:

edit COMMON_NODES.sh

Edit NODE1 and NODE2 and comment out other lines.


export NODE1=ip-172-31-0-117
export NODE2=ip-172-31-1-188

edit NODES_MASTER_SLAVE.sh


export MASTERS=($NODE1)
export SLAVES=($NODE2)

USER_VALUES.sh


export TUNGSTEN_BASE=/opt/continuent
export MY_CNF=/etc/mysql/my.cnf
export DATABASE_USER=sync
export DATABASE_SUPER_USER=root
export DATABASE_PASSWORD=

Also: 5 Things toxic to scalability

5. Create install directory (each box)

The /tmp/stage directory you created above is just a holding ground for the tarball. Continuent in it’s infinite wisdom wants to install itself. So, create a directory for it:

As root:


$ mkdir /opt/continuent
$ chown tungsten /opt/continuent

Then as tungsten:


$ touch /opt/continuent/testfile.txt

Also: How to deploy on Amazon EC2 with Vagrant?

6. Configure aws security groups

AWS security group permissions are key to getting any of this tungsten stuff working. And there are a lot of moving parts here. Ping is required for various tests, as is MySQL’s port. But you’ll also need to enable Tungsten History Log port 2112 and the replicator ports.

o ping ICMP from your group
o enable inbound 3306 from your group
o enable THL – inbound 2112 from your group
o enable RMI – inbound 10000, 10001 from your group

Also: Howto interview an AWS expert for managers, recruiters & engineers alike

7. Test mysql client from the other (each box)


$ mysql -h ip-172-31-1-188 -u root

$ mysql -h ip-172-31-0-117 -u root

If these hang, verify 3306 in your aws security groups. If you get a mysql error, be sure the “host” is appropriate. Check with:


mysql> select user, host, password from mysql.user where user = 'root';

For more info see Connect to MySQL in the Amazon public cloud.

Also: 5 Reasons to move to amazon redshift

8. Setup ssh auto-login (each box)

You need to be able to login to each box as a user “tungsten” without a password.


$ adduser tungsten
$ ssh-keygen
$ scp .ssh/id_rsa.pub tungsten@ip-172-31-1-188:/home/tungsten/.ssh/authorized_keys

Be sure the authorized_keys file is 600:


$ chmod 600 .ssh/authorized_keys

Also: Did MySQL & Mongo have a beautiful baby called Aurora

9. Install ruby (each box)

You need to have ruby on both boxes.


$ apt-get install ruby

Also: Top interview questions for a MySQL expert – recruiters, managers & candidates

10. Create sync user (each box)

Both of your mysql instances will need a user that tungsten connects as. I called it “sync”.


mysql> create user 'sync'@'%' identified by 'secret';
mysql> grant all privileges on *.* to 'sync'@'%';
mysql> flush privileges;

Note, you may want to use a blank password at this stage, to eliminate that as a potential problem to debug. Also consider the security implications of ‘%’ and consider a subnet wildcard such as ‘10.0.%’.

Also: Myth of five nines why high availability is overrated

11. Enable binary log (each box)

Enable the mysql binary log with log_bin parameter. Also ensure that /var/lib/mysql is readable by the tungsten user, either by changing /var/lib to 655 or adding tungsten to the mysql group. Test this as well using less or cat.

Also: What is high availability and why is it important?

12. Update MySQL startup settings (each box)

Fire up your favorite editor and update /etc/mysql/my.cnf settings (both servers). The following are the main ones:


server_id = 1 (server_id=2 for slave)
sync_binlog = 1
max_allowed_packet = 52m
open_files_limit = 65535
innodb_log_file_size=50m
#bind_address localhost (comment it to disable)

Note that changing innodb_log_file_size is tricky. You’ll need to stop mysql, rename old ib_logfile0 to ib_logfile0.old and ib_logfile1 to ib_logfile1.old. Then change the param in my.cnf and start mysql. Otherwise you’ll get errors on startup.

Also: Is the difference between dev & ops a four-letter word?

13. Run the installer (on master)

This step is fairly straightforward. If there are problems in this step, you’ll see ERROR lines in the output. Sift through them and resolve one by one.


$ ./cookbook/install_master_slave

Also: RDS or MySQL – Ten use cases

14. Check tungsten status (each box)

The “trepctl” utility allows you to check the current status. It does a lot more, but for now that’s enough. If you want to make it easier add “/opt/continuent/tungsten/tungsten-replicator/bin” to your path.

Also notice the line “state”. It should be ONLINE. If it says “GOING-ONLINE:SYNCHRONIZING” that likely means you didn’t open up the tungsten ports. You’ll need both RMI ports 10000 and 10001 as well as THL ports 2112. We all know how finicky AWS security groups can be. I’ll leave it to your own exercise to confirm those are open.

ON MASTER


root@ip-172-31-0-117:/opt/continuent/tungsten/tungsten-replicator/bin# ./trepctl status
Processing status command...
NAME VALUE
---- -----
appliedLastEventId : mysqld-bin.000005:0000000000000321;235
appliedLastSeqno : 5
appliedLatency : 34824.086
autoRecoveryEnabled : false
autoRecoveryTotal : 0
channels : 1
clusterName : cookbook
currentEventId : mysqld-bin.000005:0000000000000321
currentTimeMillis : 1432840323683
dataServerHost : ip-172-31-0-117
extensions :
host : ip-172-31-0-117
latestEpochNumber : 0
masterConnectUri : thl://localhost:/
masterListenUri : thl://ip-172-31-0-117:2112/
maximumStoredSeqNo : 5
minimumStoredSeqNo : 0
offlineRequests : NONE
pendingError : NONE
pendingErrorCode : NONE
pendingErrorEventId : NONE
pendingErrorSeqno : -1
pendingExceptionMessage: NONE
pipelineSource : /var/lib/mysql
relativeLatency : 44450.683
resourcePrecedence : 99
rmiPort : 10000
role : master
seqnoType : java.lang.Long
serviceName : cookbook
serviceType : local
simpleServiceName : cookbook
siteName : default
sourceId : ip-172-31-0-117
state : ONLINE
timeInStateSeconds : 82860.917
timezone : GMT
transitioningTo :
uptimeSeconds : 82861.492
useSSLConnection : false
version : Tungsten Replicator 4.0.0 build 18
Finished status command...
root@ip-172-31-0-117:/opt/continuent/tungsten/tungsten-replicator/bin#

ON SLAVE


tungsten@ip-172-31-1-188:/opt/continuent/tungsten/tungsten-replicator/bin$ ./trepctl status
Processing status command...
NAME VALUE
---- -----
appliedLastEventId : mysqld-bin.000005:0000000000000321;235
appliedLastSeqno : 5
appliedLatency : 42569.62
autoRecoveryEnabled : false
autoRecoveryTotal : 0
channels : 1
clusterName : cookbook
currentEventId : NONE
currentTimeMillis : 1432840348936
dataServerHost : ip-172-31-1-188
extensions :
host : ip-172-31-1-188
latestEpochNumber : 0
masterConnectUri : thl://ip-172-31-0-117:2112/
masterListenUri : thl://ip-172-31-1-188:2112/
maximumStoredSeqNo : 5
minimumStoredSeqNo : 0
offlineRequests : NONE
pendingError : NONE
pendingErrorCode : NONE
pendingErrorEventId : NONE
pendingErrorSeqno : -1
pendingExceptionMessage: NONE
pipelineSource : thl://ip-172-31-0-117:2112/
relativeLatency : 44475.936
resourcePrecedence : 99
rmiPort : 10000
role : slave
seqnoType : java.lang.Long
serviceName : cookbook
serviceType : local
simpleServiceName : cookbook
siteName : default
sourceId : ip-172-31-1-188
state : ONLINE
timeInStateSeconds : 1906.6
timezone : GMT
transitioningTo :
uptimeSeconds : 82882.334
useSSLConnection : false
version : Tungsten Replicator 4.0.0 build 18
Finished status command...
tungsten@ip-172-31-1-188:/opt/continuent/tungsten/tungsten-replicator/bin$

Also: Is zero downtime even possible with Amazon RDS?

15. Perform simple test

Create a table on the master mysql node.


mysql master> create database test;
mysql master> create table test.sean (c1 varchar(64));
mysql master> insert into test.sean values ('hi there');
mysql master> insert into test.sean values ('this should show up in tungsten thl file');
mysql master> insert into test.sean values ('new data on tungsten02 THL??');

Verify that the table is on the slave mysql node.


mysql slave> show databases;
mysql slave> select * from test.sean;
+------------------------------------------+
| c1 |
+------------------------------------------+
| hi there |
| this should show up in tungsten thl file |
| new data on tungsten02 THL?? |
+------------------------------------------+
3 rows in set (0.00 sec)

Success!

Also: Why are MySQL & other database experts so hard to find?

Get more. Grab our exclusive monthly Scalable Startups. We share tips and special content. Our latest Why I don’t work with recruiters

Is upgrading RDS like a shit-storm that will not end?

aws logo

Join 29,000 others and follow Sean Hull on twitter @hullsean.

Can RDS worsen an outage ?? That’s another way to think about this question. In my experience, it very clearly increases outages, by tying one or both hands behind your back. Believe me when I say, that is terribly frustrating when you’re putting out fires!

1. Changing Parameters

An everyday occurance, is the need to change database parameters. Want to enable a login, great no problem. Except in RDS it becomes a problem! Ok, you’re thinking, why is that?

In regular MySQL, you login with the shell & issue SET GLOBAL parameter = value; Nice, easy, straightforward. No servers restarting, no nonsense. If the parameter requires a reboot, MySQL will tell you.

In RDS, the process is waay more complex. First you edit a parameter group. You can copy an existing one, or change the one you’re using. If that parameter group applies to many servers, be careful!

Ok, what next? Now you APPLY that new parameter group. You can do so immediately, or during the next maintenance window. Here’s the tricky part. Is Amazon going to restart my instance? That’s something your boss or manager will surely ask you. Well you might think it would only do so if the parameter in question required it. But I tried to enable the general log recently and Amazon tells me the status of “pending-reboot”. This change shouldn’t require that! I’m sitting there scared Amazon might suddenly decide to reboot a production server for no reason!

This is where you feel you’ve lost control. You can dig through docs all you want, but you can’t ever say for sure if a managed service will behave predictably. There’s already more layers of software between you and your relational database. Not what you want.

Also: Did MySQL & Mongo have a beautiful baby called Aurora?

2. How much longer?

Another question you’ll ask yourself is, how long will this maintenance take? With MySQL at the command line, you can run through test after test & time the process. When you go to perform tasks offhours, you already have a clear picture.

With RDS, things can’t be predicted. Servers are restarted when they needn’t be. Rebuilds take forever, and you have no progress bar. EBS performance has a hiccup and your snapshot time doubles. The troubles go on and on.

Related: Is automation killing old-school operations?

3. Why did Amazon just force an OS upgrade?

Here’s another surprise I ran into. Again we have a managed solution, so Amazon must take opportunities when they can. But you pay for it in unpredictability.

Going to perform a MySQL 5.1 to 5.5 upgrade, and I’d run through test after test in advance. Timed the process to about 45 minutes. Then went to do it in production. Amazon decided to throw in the OS upgrade too, adding 40 minutes of surprise time. What’s worse? No progress bar on that either.

Upgrades are nerve wracking enough, without this kind of stuff scaring the daylights out of you.

Read: Do managers underestimate operational cost?

4. What’s happening on my server?

All of the questions about progress are opaque on RDS because you lack command line. You can’t watch processes, disk I/O or any of the granular stuff. In my surgery analogy below, it’s as though you can’t touch the patient, find their pulse or guage if their skin is cold, clammy or pale.

Also: Is the difference between dev & ops a four-letter word?

5. Surgery with blunt instruments

At the end of the day, RDS feels like surgery with blunt instruments. If command line were your scalpel, windows & GUI tools may be your remote video surgery. And worse still, RDS would be like doing surgery on the Opportunity Mars rover, after it’s landed & stuck in a valley. Everything is delayed, it’s hard to tell what’s going on, and the worst environment to work in when you have an emergency with your database.

If you have any operations experience, deploy your own MySQL on an EC2 instance. You’ll thank yourself later.

Also: Is zero downtime even possible on RDS?

Upside to RDS

Is there any upside? Why do people use it? Push-button replication. Check. Push-button multi-az, check. Those are great if you have no DBA. Automated backups so you don’t shoot yourself in the foot, check.

I guess there is *something* to love.

Get more. Grab our exclusive monthly Scalable Startups. We share tips and special content. Our latest Why I don’t work with recruiters