Tag Archives: analytics

Five ways to get your data into Redshift

redshift data pipeline

Everybody is hot under the collar this data over Redshift. I heard one customer say, a query that took 10 Hours before now finishes in under a minute. Without modification. When businesses see 600 times speedup, that can change the way they do business.

Join 32,000 others and follow Sean Hull on twitter @hullsean.

What’s more Redshift is easy to deploy. No complicated licenses like the Oracle days. No hardware, just create your cluster & go.

So you’ve made the decision, and you have data in your transactional database, MySQL RDS or Postgres. Now what?

Here are some systems that will help you synchronize data on the regular. And keep it in sync. Most of these are near real-time, so you can expect reports to be looking at the data your business created today.

1. RJ Metrics Pipeline

One of the simplest options, RJ Metrics Pipeline. Setup a trial account, configure your Redshift credentials in the warehouse section (port, user, password, endpoint) and save. Then configure your data source. For MySQL specify hostname, user, password & port. You get the option to go through an ssh tunnel for security. That’s good. You’ll also be given the grant code to create a user in MySQL for RJM.

rjmetrics table config screen

RJM uses a primary or unique key to figure out which rows have changed. Well that’s not completely true. Only if you’re using incremental refresh. If you’re using complete refresh, then it just selects all the data & replaces it each time.

The user interface is a bit clunky. You have to go in and CONFIGURE EACH TABLE you want to replicate. There’s no REPLICATE-ALL option. This is a pain. If you have 500 tables, it might take hours to configure them all.

Also since RJM isn’t CDC (change data capture) based, it won’t be as close to real-time as some of the other options.

Still RJM works and it’s pretty point-n-click.

Also: Is Amazon too big to fail?

2. xplenty

xplenty is really a lot more than just a sync tool. It’s a full featured ETL system. Want to avoid writing tons of python jobs to convert datatypes, transform 0 to paid & 1 to free, things like that? Well xplenty is made to allow building ETL systems without code.

xplenty main dashboard

It’s a bit complex to setup at first, but very full featured. It is the DIY developer or DBAs tool of the bunch. If you need hardcore functionality, xplenty seems to have it.

Also: When hosting data on Amazon turns bloodsport?

Is Data your dirty little secret?

3. Alooma

Alooma might possibly be the most interesting of the bunch.

After a few stumbles during the setup process, we managed to get this up and running smoothly. Again as with xplenty & Fivetran, it uses CDC to grab changes from the MySQL binlogs. That means you get near realtime.

alooma dashboard

Although it’s a bit more complex to setup than Fivetran, it gives you a lot more. There’s excellent visibility around data errors, which you *will* have. Knowing where they happen, means your data team can be very proactive. This is great for the business.

What’s more there is a python based Code Engine which allows you to write bits of code that transform data in the pipeline. That’s huge! Want to do some simple ETL, this is a way to do that. Also you can send notifications, or requeue events. All this means you get state of the art pipeline, with good configurability & logging.

Read: Is aws a patient that needs constant medication?

4. Fivetran

Fivetran is super point-n-click. It is CDC based like Flydata & Alooma, so you’re gonna get near realtime sync with low overhead. It monitors your binlogs for changed data, and ships it to Redshift. No mess.

The dashboard is simple, the setup is trivial, and it just seems to work. Least pain, best bang.

Related: Does Amazon eat it’s own dogfood?

5. Other options

There are lots of other ways to get data into Redshift.

Flydata

I did manage to get Flydata working at a customer last year. It’s a very viable option. I wrote at length about that solution I’ll leave you to read all about it there.

AWS Data Pipeline

I’ve started to kick the tires of AWS Data Pipeline but haven’t decided if it’s the best option for customers.

Nightly rebuild

The Donors Choose Tech Blog posted about their project which can move data from postgres to redshift. You can find the project here.

This will do a *full* reload each night, so if your db is too big for that, it might need modifications. Also if you’re using MySQL as source db you’ll need to change code. One thing I found in there was Perl & Sed commands to transform your source schema CREATE & ALTER statements into Redshift compatible ones. That in itself is worth a look.

Lambda to the rescue

The awslabs github team has put together a lambda-based redshift loader. This might be just what you need. Remember thought that’ll you’ll need to deliver your source data in CSV files to S3 on the regular. So you’ll need some method to dump it. But if you have that half of the equation, this is ideal.

Data Migration Serve or DMS

This appears to have supported Redshift early on, but does not appear to do so now. I’ve gotten conflicting reports, so I should dig a bit more. Anybody want to comment on this one?

Tungsten

I tried & tried & tried to get Tungsten to work. I did have some success but was still blocked by data problems which remained unresolved. To my mind the project is still broken or at least very buggy.

Also: Is AWS too complex for small dev teams?

Get more. Grab our exclusive monthly Scalable Startups. We share tips and special content. Our latest Why I don’t work with recruiters

5 data points I track for reputation & career building

When I tell people I’ve been independent for two decades, they often look at me surprised. How do you do that? How do you keep business coming in?

recent linkedin views

Join 32,000 others and follow Sean Hull on twitter @hullsean.

As a freelancer you surely have to be on top of changing trends, and where the wind is blowing. But whether you’re a CEO or CTO of a larger firm, or a developer, HR or marketing director, you can also benefit by actively tracking yourself. Career building never ends…

1. Real Leads

This is probably the hardest metric to track, but the most important. A lead is anyone who may potentially hire my services. These can come from Linkedin, newsletter subscribers, or via a Google search. I track how they reached me, and how warm the lead is.

I do also track when recruiters reach out, as I think this can serve as a useful barometer as well. Also as my blog has grown, I get a lot of SEO bloggers, fishing for sites they can post backlinks on. Although I rarely entertain them, it is a useful reflection of how popular your site is getting.

Also: Are we fast approaching cloud-mageddon?

2. Newsletter signups

I think of the newsletter as an extension of my blog. I invite everyone I’ve ever touched in business. This includes coworkers, to colleagues at meetups & conferences. I invite recruiters & headhunters as well, because name recognition & reputation building is also important.

The newsletter is a way to show up in the inbox of everybody you’ve ever worked with. Month after month, year in and year out, you’re plodding away & doing your thing. It’s a reminder that you’re out there, and colleagues, CEOs & CTOs refer me all the time. It’s been very valuable over ten years.

newsletter signups

I also track email opens & email clicks. Those range around 25% and 10% respectively. I know when I’ve hit a topic that resonates & try to have that inform future content direction.

Related: The Myth of Five Nines

3. Linkedin Views

Linkedin is super valuable too. They provide a nice graph of how many times your profile was viewed weekly through to the last 90 days. This is super useful to find out if your resume & profile is keyword rich.

I like to actively tweak my profile, for the latest trending terminology. For example in the 90’s Unix Administrator or Systems Administrator was common, but nowadays everyone likes to say SRE. What’s that? Site Reliability Engineer. Yes it’s a buzzword, and as it turns out people use trending terms & buzzwords to search for people with your skills.

So get on it, and edit those terms!

Read: Is Amazon too big to fail?

4. Website Visitors

In a services business you don’t usually sell widgets on your website. However, I like to think of a web presense as my business card. So in that light, more visitors means more renown. That projects your personal brand, and builds it long term.

website visitors

Also: When hosting data on Amazon turns bloodsport

5. Klout Score

Klout score is a rough measure of how active you are across social media. Twitter is a big one, but it also finds you on Linkedin & other platforms as well. Although the score is far from perfect, it does give you a sense of reputation & noteriety, which do ultimately translate to business.

Also: 5 Things Toxic To Scalability

Get more. Grab our exclusive monthly Scalable Startups. We share tips and special content. Our latest Why I don’t work with recruiters

Sean Hull interviewed on the Doppler Cloud podcast

I recently got a chance to talk with Mike Kavis over at Cloud Technology Partners. It was fun to get away from the keyboard, and in front of the microphone for a change.

Join 32,000 others and follow Sean Hull on twitter @hullsean.

1. Docker

Docker is making deployments easier & easier. But as the pace accelerates, are we introducing vulnerabilities & scalability problems faster than we can fix them?

Also: Are generalists better at scaling the web?

2. Redshift

I’ve blogged that I don’t work with recruiters but I do chat with them regularly.

In a recent conversation a recruiter asked me:

“Why is it that suddenly everyone is looking for Redshift?”

I’m seeing the same trend. And if you look at Hadoop you might see why. Writing SQL queries against Redshift data is wildly simpler than writing EMR jobs for Hadoop.

Related: Why Dropbox didn’t have to fail?

3. Devops automation

These days I hear a lot of talk that all operations is software development. Are you still SSHing into boxes. You’re doing it wrong!

Read: When hosting data on Amazon turns bloodsport

4. Hardware solves all speed problems

Having performance problems? Scale out! Database slow, scale up! These days it seems the old short sighted way of thinking is back with a vengence. Throw hardware at the problem and kick the can down the road.

Also: How to hire a developer that doesn’t suck?

5. Amazon disrupting VC

During dot-com version one-point-oh, you’d need hundreds of thousands to buy hardware & software licenses to get an idea off the ground. That necessarily meant real VC money to get off the ground.

Amazon web services & on-demand computing has brought world class infrastructure to even the smallest startups. For just dollars, they can get started.

Now we’re seeing startups get going with micro investments from the likes of Angel List syndicates. Cutting traditional VCs right out of the equation.

Also: Is Amazon too big to fail?

Get more. Grab our exclusive monthly Scalable Startups. We share tips and special content. Our latest Why I don’t work with recruiters

Is Redshift outpacing Hadoop as the big data warehouse for startups?

redshift hadoop killer

More and more startups are looking at Redshift as a cheaper & faster solution for big data & analytics.

Join 32,000 others and follow Sean Hull on twitter @hullsean.

Saggi Neumann posted a pretty good side-by-side comparison of Redshift & Hadoop and concluded they were tied based on your individual use case.

Meanwhile Bitly engineering concluded Redshift was much easier.

1. More agile

One thing pointed out by the bitly blog post, which I’ve seen countless times, is the slow iteration cycle. Write your map-reduce job, run, test, debug, then run on your cluster. Wait for it to return and you might feel like you’re submitting a stack of punched cards. LOL Resolve the errors that come back and then rerun on your cluster. Over & over & over again.

With Redshift you’re writing SQL, so your iterating through syntax errors quickly. What’s more since Redshift is a column-compressed database, you can do full table scans on columns without indexes.

What that means for you and me is that queries just run. And they run blazingly fast!

Also: When hosting data on Amazon turns bloodsport

2. Cheap

Redshift is pretty darn cheap.

Saggi’s article above quotes Redshift at $1000/TB/yr for reserved, and $3700/TB/yr for on-demand. This compared with a hadoop cluster at $5000/TB/yr.

But neither will come with spitting distance of the old-world of Oracle, where customers host big iron servers in their own datacenter, paying north of a million dollars between hardware & license costs. Amazon cloud FTW!

Related: Did dropbox have to fail?

3. Even faster

Airbnb’s nerds blog has a post showing it costing 25% of a Hadoop cluster, and getting 5x performance boost. That’s pretty darn impressive!

Flydata has done benchmarks showing 10x speedup.

Read: Are SQL Databases dead?

4. SQL Toolchains

***

Also: 5 core pieces of the Amazon cloud puzzle to get your project off the ground

5. Limitations

o data loading

You load data into Redshift using the COPY command. This command reads flat files from S3 and dumps them into tables. It can be extremely fast if you do things in parallel. However getting your data into those flat files is up to you.

There are a few solutions to this.

– amazon data pipeline

This is Amazon’s own toolchain, which allows you to move data from RDS & other Amazon hosted data sources. Data pipeline does not move data realtime, but in batch. Also it doesn’t take care of schema changes so you have to do that manually.

I mentioned it in my 5 reasons to move data to Amazon Redshift

– Flydata service

Flydata is a service with a monthly subscription which will connect to your RDS database, and move the data into Redshift. This seems like a no brainer, and given the heft pricetag of thousands per month, you’d expect it to cover your bases.

In my experience there are a lot of problems & it still required a lot of administration. When schema changes happen, those have to be carefully applied on Redshift. What’s more there’s no silver bullet around the datatype differences.

Also: Some thoughts on 12 factor apps

Flydata also makes use of the binary logs to replicate your data. Anything that doesn’t show up in the binary logs is going to cause you trouble. That includes when you do sql_log_bin=0 in the session, an SQL statement includes a no logging hint. Also watch out for replicate-ignore-db options in your my.cnf. But it also will fail if you use ON DELETE CASCADE. That’s because these downstream changes happen via Constraint in MySQL. But… drumroll please, Redshift doesn’t support ON DELETE CASCADE. In our case the child tables ended up with extra rows, and some queries broke.

– scripts such as Donors choose loader

Donors Choose has open sourced their nightly Redshift loader script. It appears to reload all data each night. This will nicely sidestep the ON DELETE CASCADE problem. As you grow though you may quickly hit a point where you can’t load the entire data set each night.

Their script sources from Postgres, though I’m interested to see if it can be modified for MySQL RDS.

– Tried & failed with Tungsten replicator

Theoretically Tungsten replicator can do the above. What’s more it seems like a tool custom made for such a use case. I tried for over a month to troubleshoot. I worked closely with the team to iron out bugs. I wrote wrestling with bears or how I tamed Tungsten replicator for MySQL and then I wrote a second article Tungsten replicator the good the bad & the ugly. Ultimately I did get some data moving between MySQL RDS & Redshift, however certain data clogged the system & it wouldn’t work for any length of time.

Also: Secrets of a happy Amazon hacker or how to lock down your account with IAM and multi-factor authentication

o data types & character sets

There are a few things here to keep in mind. Redshift counts bytes, so if in mysql or some other database you had a varchar(5) it may be varchar(20) in Redshift. Even then I had cases where it still didn’t fit & I had to make the field bigger by 4.

I also ran into problems around string character encodings. According to the docs Redshift handles 4-byte UTF-8.

Redshift doesn’t support ARRAYs, BIT, BYTEA, DATE/TIME, ENUM, JSON and a bunch of others. So don’t go into it expecting full Postgres support.

What you will get are multibyte characters, numeric, character, datetime, boolean and some type conversion.

Also: Is the difference between dev & ops a four-letter word?

o rebalancing

If and when you want to add nodes, expect some downtime. Yes theoretically the database is online while it’s shipping data to the new nodes & redistributing things, the latency can start to feel like an outage. What’s more it can easily push into the hours to do.

Also: Is AWS enabling startups which enable AngelList Syndicates to boil the VC business?

Get more. Grab our exclusive monthly Scalable Startups. We share tips and special content. Our latest Why I don’t work with recruiters

Why I like Etsy’s site performance report

etsy code as craft

Etsy publishes a great tech blog titled Code As Craft.

Join 28,000 others and follow Sean Hull on twitter @hullsean.

I was recently sifting through some of their newer posts & stumbled upon their Q2 2015 Site Performance Report. It’s really in-depth, though not impossibly technical. Here’s what I liked.

1. Transparency to business & public

Show real performance to customers

The first thing I thought while reading, is the strong show of transparency. The blog is public, so it’s not just an internally facing document that shares with the company, but sharing with the wider world. True, presented as a technical post it may only appeal to a segment of readers, but it’s great none the less.

Show real performance to non-technical business units

I think this kind of analysis & summary also provides transparency to the business itself. Product teams, business operations & sales teams can all view what’s happening. Where are there problems? What is being done to address them?

Also: When hosting data on Amazon turns bloodsport

2. Highlighting change

Added pagination to the cart

One thing that popped out, was the discussion of pagination changes, that impacted page load times in the shopping cart. Page load times in the shopping cart are particularly crucial, because that’s where customers can “abandon” an order out of frustration.

Illustrating performance impact to product decisions

When product is evaluating that new feature, and they can see how changes affect performance, it better *sells* what all those engineering resources are being used for.

Related: 5 reasons to move data to amazon redshift

3. Where we don’t have data

We can’t analyze what data we haven’t captured

The report highlights that data around the shopping cart is new. That’s great because it highlights what the value collecting data offers, by providing new insights that were not available previously. This also pushes for more metrics collection & analysis as the business begins to see the value of all of this gymnastics.

Read: Is Amazon too big to fail?

4. Product tradeoffs

The discussion around the shopping cart performance also illustrates how the business makes product decisions. The engineering team can only build & write so much code. Deciding to spend time on pagination, means time not spent on some other new feature. Which is more valuable? Selling new feature A in one corner of the product, that customers may spend real money on? Or speeding up page load times on page B?

Also: Is Apple betting against big data?

5. Cleaner data

At a Look & Tell event, I heard Lincoln Ritter talk about Data as a product to the business.

When you expose a performance report like this to the business, an iterative process begins to happen. The company gains insight from the report, makes better decisions, and thus can spend more energy time & resources on clean data. Cleaner data in term means better reports, which produce better decisions & so on.

Also: What is venue analytics & why is it important?

Get more. Grab our exclusive monthly Scalable Startups. We share tips and special content. Our latest Why I don’t work with recruiters

My blog traffic is growing – using 5 killer tactics

With a million websites & blogs offering endless advice on growing your blog, it took me awhile to get it right. But after months of experimentation I’m getting steady growth in traffic. Where once 10,000 visitors in a month seemed like a dream, now it’s regular.

Here’s what I do.

Also find Sean Hull’s ramblings on twitter @hullsean.

1. Think up a killer title or something controversial 

Trending topics are good. Even better is to tie in big events like the sandy storm that affected everyone. Follow these very clickable sites like businessinsider.com and use their model for writing titles. They are hard to resist, aren’t they? Beware actual link baiting though, this won’t win any friends at Google.

Also take a look at Why generalists are better at scaling the web.

2. Write content around that title

It may be counterintuitive to write the title first. But it really informs the content, and frame of your writing. Also pull out highlights and QUOTE them so they stand out. Also use nice solid bullet sections for easy scan reading. The internet is all about scanning through material, so make this easy for your audience.

A popular piece MySQL DBA Interview Guide.

Also, this is big, provide lots of INBOUND links to your popular content. This will give you a pile on affect, driving more pageviews and lowering bounce rate! That’s killing two birds with one stone.

[quote]Start with a great title, something that is irresistible, write content that is easy to scan & highlights your points, then let the world know about it by sharing creatively.[/quote]

3. Share the shit out of the content

Excuse my french, but if you don’t share it, people won’t like find it. And there’s a trick to this too.

Another popular one we wrote: Why the Android Ecosystem is Broken.

First I search twitter for related posts, then reply comment with your link. Say something relavent don’t just spam your links. Of course you’ll want to stumble, linkedin, news.ycombinator.com/submit post as well.

You might also look into blog carnivals, services like ping.fm, as well as tools that automate bookmarking across many sites.

4. Search news.google.com for blogs talking about the topic

Since you already chose a trending or in-the-news topic, you’ll be able to search and find other bloggers talking about it. Pick blogs with higher page ranks, 5, 6 or 7’s are nice. Read, then comment & share your link. Again don’t just spam your links here, but provide some reasoned commentary. This shows off your personality, and provides incentive for people to want to read more of your content. I find this very easy for disqus blogs, and really focus my efforts on those.

Why is it so hard to locate & hire The Mythical MySQL DBA?

5. Watch your analytics to see where you got traction.

This is the fun part, at least for me. Click on the content, then select secondary sort “traffic sources” to figure out where people are coming from. Tune your efforts to the sources and techniques that work for you. This will depend on your particular audience.

I also like to view Content-All pages in Google Analytics. Then find the horizontal graph bars button (rollover for Performance). Click that and you can see % pageviews by content title. Very handy.

6. Rinse & repeat

This is my favorite part. Once you see what you are doing right, you can do more of it!

Read this far? Grab our Scalable Startups for more tips and special content. Here’s a sample

Performance Metrics – What are they and why are they important?

In order to understand how fast your website is, we need some numbers.  We call these fancy numbers performance metrics, objective measures that we can track over time.  We can track them for seasonality as well as website traffic and growth.  But we can also track them for feature and application changes based on deployments to see if new code has caused perfermance to improve or degrade.

Some useful business or application performance metrics include:

  • user registrations
  • accounts sold
  • widgets sold
  • user interactions & social metrics
  • so-called gamification, ratings & related

We also want to capture lower-level system  metrics with a tool like Cacti, Ganglia, Munin, OpenNMS, Zabbix or similar:

  • cpu
  • memory
  • disk
  • network
  • load average

Along with the basic system level metrics you’ll want to collect some at the database level such as:

  • InnoDB Buffer Pool activity
  • Files & Tables
  • Binary log activity
  • Locking
  • Sorting
  • Temporary objects
  • Queries/second

Sean Hull asks on Quora – What are web performance metrics and why are they important?

Big Data – What is it and why is it important?

There’s lots of debate about exactly what constitutes “big” when talking about big data.  Technical folks may be inclined to want a specific number.

But when most CTOs and operations managers are talking about big data, they mean data warehouse and analytics databases.  Data warehouses are unique in that they are tuned to run large reporting queries and churn through large multi-million row tables.  Here you load up on indexes to support those reports, because the data is not constantly changing as in a web-facing transaction oriented database.

More and more databases such as MySQL which were originally built as web-facing databases are being used to support big data analytics.  MySQL does have some advanced features to support large databases such as partitioned tables, but many operations still cannot be done *online* such as table alters, and index creation.  In these cases configuring MySQL in a master-master active/passive cluster provides higher availability.  Perform blocking operations on the inactive side of the cluster, and then switch the active node.

We’ve worked with MySQL databases as large as 750G in size and single user tables as large as 40 million records without problems.  Table size, however has to be taken into consideration for many operations and queries.  But as long as your tables are indexed to fit the query, and you minimize table scans especially on joins, your MySQL database server will happily support these huge datasets.

Sean Hull discusses on Quora – What is Big Data and why is it important?

Venue Analytics – What is it and why is it important?

Analytics provide insight into what your web traffic represents.  It helps you answer questions like:

  • Who visits my website and what do they read?
  • What do those users click on?
  • How can I turn those users into customers?

Venue analytics is a growing area of tracking that provides this type of insight to venues, restaurants, and other bricks and mortar businesses.  If users are clicking around on Google, Yelp, Menupages, or New York Mag or finding a restaurant some other way, they are typically using their mobile phones to do so.  So venue analytics provides tools to businesses to answer questions like:

  • Who is searching for an italian restaurant like mine?
  • What other restaurants did they browse before coming to my restaurant?
  • They browsed my restaurant, but went elsewhere, why?
  • What can I do to entice customers when they are browsing by mobile phone?

Sean Hull asks on Quora: What is venue analytics and why is it important?