Big data scientist interview questions

Screen-shot-2012-08-02-at-1.28.35-PM

Everybody wants to hire a data scientist these days. According to read write the role is overhyped and overpaid. Hype aside, what’s a good approach to interviewing these hard to find people?

Here’s Read Write’s guide. Also Hilary Mason has an interview guide. Also take a look at Chris Pearson’s data scientist hiring guide.

Join 28,000 others and follow Sean Hull on twitter @hullsean.

While you’ll surely have technical questions to ask, we figure you may already have a handle on those. They’ll vary from business to business.

What we’ve put together is a series of questions that we hope will tease out some good stories, and underscore a candidates real-world experience. These are are also great for the cross section of folks involved in the hiring process, higher level managers, HR & recruiters, plus technical folks that may have data near & dear to them.

1. What’s common?

What key metrics do you see firms repeatedly missing? Why are they important?

You’ve worked as data scientist before, and run into a lot of problems at different firms. Inevitably, some of those repeat themselves. Give an example of a metric you see over-and-over, that’s essential, but often missing attention.

Also: Is the difference between dev & ops a four-letter word?

2. What’s your favorite?

What is your favorite KPI and why?

As a data scientist, you’ve probably approached different companies, and found a couple of indicators that you particularly like. Maybe they highlight potential for growth? Or lead to other interesting discoveries about the business?

Related: Is automation killing old-school operations?

3. Let’s talk dollars

Give an example of a financial benefit you brought to a firm. How & how much?

Give an example where a measurement you made, and a business change it informed had real ROI for the business. What was that discovery? How did the business make the change? What was the financial benefit to the bottom line?

Read: Do managers underestimate operational cost?

4. Business data discovery

Give an example where you discovered data the business didn’t know it had. What & How?

Sometimes businesses have assets stored, that have been forgotten. Perhaps they’ve been archived, or a collection job has been forgotten. Perhaps it’s a corner of salesforce that hasn’t been evaluated. How did you bring the new data to light, and make use of it?

Also: Is the difference between dev & ops a four-letter word?

5. Why do you love data?

Why is data scientist your chosen career path?

This is an open ended question, but should spark some stories. Perhaps the candidate enjoys working with tech, product & biz-ops equally? Why are your skills uniquely suited to the role over other technical careers?

Also: Is the difference between dev & ops a four-letter word?

Get more. Grab our exclusive monthly Scalable Startups. We share tips and special content. Our latest Why I don’t work with recruiters

Why you need a performance dashboard like StackExchange

stackexchange

Most startups talk about performance crucial. But often with all the other pressing business demands, it can be forgotten until it becomes a real problem.

Flipping through High Scalability today, I found a post about Stack Exchange’s performance dashboard.

Join 28,000 others and follow Sean Hull on twitter @hullsean.

The dashboard for Stack Exchange performance is truly a tectonic shift. They have done a tremendous job with the design, to make this all visually appealing.

But to focus just on the visual aesthetics would be to miss many of the other impacts to the business.

1. Highlight reliability to the business

Many dashboards, from Cacti to New Relic present performance data. But they’re also quite technical and complicated to understand. This inhibits their usefulness across the business.

The dashboard at Stack Exchange boils performance down to the essentials. What customers are viewing, how quickly the site is serving them, and where bottlenecks are if any.

Also: Is the difference between dev & ops a four-letter word?

2. What’s our architecture?

Another thing their dashboard does is illustrate their infrastructure clearly.

I can’t count the number of startups I’ve worked at where there are extra services running, odd side utility boxes performing tasks, and general disorganization. In some cases engineering can’t tall you what one service or server does.

By outlining the architecture here, they create a living network diagram that everyone benefits from.

Related: Is automation killing old-school operations?

3. Because Fred Wilson says so

If you’re not convinced by what google says, consider Fred Wilson who surely should know. He says speed is an essential feature. In fact *the* essential feature.

The 10 Golden Principles of Successful Web Apps from Carsonified on Vimeo.

Read: Do managers underestimate operational cost?

4. Focus on page loading times!

If you scroll to the very bottom of the dashboard, you have two metrics. Homepage load time, and their “questions” page. The homepage is a metric everyone can look at, as many customers will arrive at your site though this portal. The questions page will be different for everyone. But there will be some essential page or business process that it highlights.

By sifting down to just these two metrics, we focus on what’s most important. All of this computing power, all these servers & networks are all working together to bring the fastest page load times possible!

Also: Is the difference between dev & ops a four-letter word?

5. Expose reliability to the customer

This performance page doesn’t just face the business. It also faces the customers. It lets them know how important speed is, and can underscore how serious the business takes it’s customers. Having an outage or a spike that’s slowing you down. Customers have some transparency into what’s happening.

Also: Is the difference between dev & ops a four-letter word?

Get more. Grab our exclusive monthly Scalable Startups. We share tips and special content. Our latest Why I don’t work with recruiters

If you’re building a startup tech blog you need to ask yourself this question

Editor & writer in friendly dialog

Join 28,000 others and follow Sean Hull on twitter @hullsean.

I work at a lot of startups, and these days more and more are building tech blogs. With titles like labs or engineering at acme inc, these can be great ways to build your brand, and bring in strong talent.

So how do we make them succeed? It turns out many of the techniques that work for other blogs apply here, and regular attention can yield big gains.

1. Am I using snappy headlines?

Like it or not we live in a news world dominated by sites like Upworthy, Business Insider, Gawker & Huffpo. Ryan Holiday gained fame using a gonzo style as director of marketing at American Apparel. Ryan argues that old-style yellow journalism is back with a vengence.

Click bait asside, you *do* still need to write headlines that will click. What works often is for your title to be a little sound bite, encapsulating the gist of your post, but leaving enough hook that people need to click. Don’t be afraid to push the envelope a bit.

Also: Which tech do startups use most?

2. Line up those share buttons & feedburner

Of course you want to make the posts easy as hell to share. Cross posting on twitter, linkedin, facebook and whereever else your audience hangs out is a must. Use tools like hootsuite & buffer to line up a pipeline of content, and try different titles to see which are working.

You’ll also want to enable feedburner. Some folks will add your blog to feedly. Subscriber counts there can be a good indication of how it is growing in popularity too.

Related: Do today’s startups assemble software at their own risk?

3. Watch & listen to google analytics

You’re going to keep an eye on traffic by installing a beacon into your page header. There are lots of solutions, GA being the obvious one because it’s free. But how to use it?

Ask yourself questions. Who are my readers? Where are they coming from? How long do they spend on average? Do some pages spur readers to read more? Is there copy that works better for readers? Are my readers converting?

It’ll take time if you’re new to the tool, but start with questions like those.

Read: Is automation killing old-school operations?

4. Optimize your SEO a little bit

Although you don’t want to go overboard here, you do want to pay some attention. Using keyword rich titles, and < h2 > tags, along with wordpress SEO plugins that support other meta html tags means you’ll be speaking the language search engines understand. Add tags & categories that are relevant to your content.

Don’t overdo it though. Stick to a handful of tags per post. If you add zillions with lots of word order combinations & so forth, this kind of stuff may tip of the search engines in ways that work against you.

Check out: How to hire a developer that doesn’t suck

5. Search for untapped keywords

When I first started getting serious about blogging, I had an intern helping me with SEO. She did some searching with the moz keyword research tools and found some gems. These are searches that internet users are doing, but for which there still is not great content for.

For example if results showed “cool tech startups in gowanus brooklyn” had no strong results, then writing an article that covered this topic would be a winner right away.

These are big opportunities, because it means if you write directly for that search, you’ll rank highly for all those readers, and quickly grow traffic.

Read also: 5 things toxic to scalability

Get more. Grab our exclusive monthly Scalable Startups. We share tips and special content. Our latest Why I don’t work with recruiters

Is Zero downtime even possible on RDS?

amazon rds mysql

Join 29,000 others and follow Sean Hull on twitter @hullsean.

Oh RDS, you offer such promise, but damn it if the devil isn’t always buried in the details.

Diving into a recent project, I’ve been looking at upgrading RDS MySQL. Major MySQL upgrades can be a bit messy. Since the entire engine is rebuilt, queries performance can change, syntax can break, and surely triggers & stored procedures can have problems.

That’s not even getting into it with storage engines. Still have some tables on MyISAM? Beware.

The conclusion I would make is if you want zero downtime, or even nearly zero, you’re going to want to roll your own MySQL on EC2 instances.

Read: Why high availability is so very hard to deliver

1. How long did that upgrade take?

First thing I set out to do was upgrade a test instance. One of the first questions my client asked, how long did that take? “Ummm… you know I can’t tell you clearly.” For an engineer this is the worst feeling. We live & die by finding answers. When your hands are tied, you really can’t say what’s going on behind the curtain.

While I’m sitting at the web dashboard, I feel like I’m trying to pickup a needle with thick leather gloves. Nothing to grasp here. At one point the dashboard was still spinning, and I was curious what was happening. I logged out and back in again, and found the entire upgrade step had already completed. I think that added five minutes to perceived downtime.

Sure I can look at the RDS instance log, and tell you when RDS logged various events. But when did the machine go offline, and when did it return for users? That’s a harder question to answer.

Without command line, I can’t monitor the process carefully, and minimize downtime. I can only give you a broad brush idea of what’s happening.

Also: RDS or MySQL 10 use cases

2. Did we need to restart the instance?

RDS insists on rebooting the instance itself, everytime it performs a “Modify” operations. Often restarting the MySQL process would have been enough! This is like hunting squirrels with a bazooka. Definitely overkill.

As a DBA, it’s frustrating to watch the minutes spin by while your hands are tied. At some point I’m starting to wonder… Why am I even here?

Related: Howto automate MySQL slow query analysis with Amazon RDS

3. EBS Snapshots are blunt instruments

RDS provides some protection against a failed upgrade. The process will automatically snapshot your volume before it begins. That’s great. If I spend

See also: Is Amazon RDS hard to manage

4. Even promoting a read-replica sucks

I also evaluated using a read-replica. Here you spinup a slave first. You then upgrade *THAT* box to 5.6 ahead of your master. While your master is still sending data to the slave, your downtime would in theory be very minimal. Put master in read-only mode, wait few seconds for slave to catchup and switch application to point to slave, then promote it!

All that would work well with command line, as your instances don’t restart. But with RDS, it takes over seven long minutes!

Read this: 5 Reasons to move data to Amazon Redshift

5. RDS can upgrade to MySQL 5.6!

MySQL 5.6 introduced a new timestamp datatype which allows for fractional seconds. Great feature, but it means the on-disk datastructures are different. Uh oh!

If you’re doing replication with MySQL 5.5 to 5.6 it will break because the rows will flow out in one size, and break the 5.6 formatted datafiles! Not good.

The solution requires running ALTER commands run on the master beforehand. That in turn locks up tables. So it turns out promoting a read-replica is a non-starter for 5.5 to 5.6. Doesn’t really save much.

All of this devil in the details stuff is terrible when you don’t have command line access.

Read: Are SQL databases dead?

Get more. Grab our exclusive monthly Scalable Startups. We share tips and special content. Our latest Why I don’t work with recruiters