Category Archives: Database Operations

How to interview an amazon database expert

via GIPHY

Amazon releases a new database offering every other day. It sure isn’t easy to keep up.

Join 35,000 others and follow Sean Hull on twitter @hullsean.

Let’s say you’re hiring a devops & you want to suss out their database knowledge? Or you’re hiring a professional services firm or freelance consultant. Whatever the case you’ll need to sift through for the best people. Here’s how.

Also: How to interview an AWS expert

What database does Amazon support for caching?

Caching is a popular way to speed up access to your backend database. Put Amazon’s elasticache behind your webserver, and you can reduce load on your database by 90%. Nice!

The two types that amazon supports are Memcache & Redis. Memcache is historically more popular. These days Redis seems a clear winner. It’s faster, and can maintain your cached data between restarts. That will save you I promise!

Also: Is AWS too complex for small dev teams?

How can I store big data in AWS?

Amazon’s data warehouse offering is called Redshift. I wrote Why is everyone suddenly talking about Redshift?. Why indeed!

When you’re doing large reports for your business intelligence team, you don’t want to bog down your backend relational database. Redshift is purpose built for this use case.

I’ve see a report that took over 8 hours in MySQL return in under 60 seconds in Redshift!

A new offering is Amazon Spectrum. This tech is super cool. Load up all your data into S3, in standard CSV format. Then without even loading it into Redshift, you can query the S3 data directly. This is super useful. Firstly because S3 is 1/10th the price. But also because it allows you to stage your data before loading into Redshift itself. Goodbye Google Big Query! I talked about spectrum here.

Related: Which engineering roles are in greatest demand?

What relational database options are there on Amazon?

Amazon supports a number of options through it’s Relational Database Service or RDS. This is managed databases, which means less work on your DBAs shoulders. It also may make upgrades slower and harder with more downtime, but you get what you pay for.

There are a lot of platforms available. As you might guess MySQL & Postgres are there. Great! Even better you can use MariaDB if that’s your favorite. You can also go with Aurora which is Amazon’s own home-brew drop in replacement for MySQL that promises greater durability and some speedups.

If you’re a glutton for punishment, you can even get Oracle & SQL Server working on RDS. Very nice!

Read: Can on-demand consulting save startups time & money?

Does AWS have a NoSQL database solution?

If NoSQL is to your taste, Amazon has DynamoDB. According to . I haven’t seen a lot of large production applications using it, but what he describes makes a lot of sense. The way Amazon scales nodes & data I/O is bound to run into real performance problems.

That said it can be a great way to get you up and running quickly.

Read: Can on-demand consulting save startups time & money?

How do I do ETL & migrate data to AWS?

Let’s be honest, Amazon wants to make this really easy. The quicker & simpler it is to get your data there, that more you’ll buy!

Amazon’s Database Migration Service or DMS allows you to configure your old database as a data source, then choose a Amazon db solution as destination, then just turn on the spigot and pump your data in!

ETL is extract transform and load, data warehouse terminology for slicing and dicing data before you load it into your warehouse. Many of todays warehouses are being built with the data lake model, because databases like Redshift have gotten so damn fast. That model means you stage all your source data as-is in your warehouse, then build views & summary tables as needed to speed up queries & reports. Even better you might look a tool like xplenty.

Amazon’s new offering is called Glue. Five ways to get data into Amazon Redshift. This solution is purpose build for creating a powerful data pipeline, complete with python code to do transformations.

Read: Is data your dirty little secret?

Get more. Grab our exclusive monthly Scalable Startups. We share tips and special content. Our latest Why I don’t work with recruiters

How to build an operational datastore on AWS with S3 & Redshift

via GIPHY

You’re building your data warehouse, and getting data into Redshift. You’ve got your ETL pipeline running, and presentation layer talking to the warehouse. Great.

But how to get access to that source data? Wouldn’t it be nice if that was close by too?

Join 35,000 others and follow Sean Hull on twitter @hullsean.

It may be you have 10-zillion rows of source data and don’t want or need to get all of that into Redshift and keep it there. But it would be nice to have access to it when you do.

Enter EXTERNAL tables, aka Spectrum. Now you can keep all your raw data in S3, an in place operational datastore of data before it’s been reworked and transformed. Use SQL to access it right where it sits.

Get all the advantages of lifecycle management in S3, and don’t pay all the redshift costs for data you don’t need all the time. Cool!

Let’s see how it works.

What is an EXTERNAL table?

Spectrum is Amazon’s rebranding of an old database technology called EXTERNAL TABLES. Back in the 90’s Oracle pioneered this work, allowing you to essentially map a CSV file, that sits outside the database proper. This means you can query all that juicy data sitting in flat files. Cool!

Athena allows you to query this stuff as a service, native to AWS. Spectrum allows you to create those external tables inside of Redshift.

Also: Top serverless interview questions for hiring aws lambda experts

Give Redshift permissions

Go into IAM and create a new role called “SeanSpectrumRole”. Assign the policy AmazonS3ReadOnlyPolicy. It looks like this:


{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"s3:Get*",
"s3:List*"
],
"Resource": "*"
}
]
}

If you’re using the dashboard you just pick the policy from the named list. However if you’re using CloudFormation, you’ll use the code above.

Now navigate your aws console to the Redshift dashboard, click clusters, and click the checkbox for your cluster. Probably there’s only one.

Now click the “Manage IAM Roles” button, and a dialog should popup.. Select the role you created earlier, SeanSpectrumRole. Then click “Apply Changes”.

The beauty of the AWS world is that servers themselves can have API permissions. In this case we gave the redshift cluster or server itself, access to S3 for our use below!

Related: Which engineering roles are in greatest demand?

Create your spectrum schema

First you must create a spectrum schema. Here’s the syntax:


create external schema spectrum
from data catalog
database 'sean'
region 'us-east-1'
iam_role 'arn:aws:iam::9999999999999:role/SeanSpectrumRole';

Read: Can on-demand consulting save startups time & money?

Upload your data to S3 bucket

Here we create an s3 bucket called sean_spectrum, then upload one csv file named sean_numbers.txt.


$ aws s3api create-bucket --bucket sean_spectrum --region us-east-1
{
"Location": "/sean_spectrum"
}
$ cd spectrum/
$ cat sean_numbers.txt
21,Dr.,Who,44-22-55-77-88
35,Bat,Man,317-222-4777
15,Wonder,Woman,999-324-7878
99,Storm,Cloud,367-399-6767
75,Marvel,Girl,222-333-9595
32,Quick,Silver,22-33-77-99
12,Scarlet,Witch,23-35-47-555
$ aws s3 cp sean_numbers.txt s3://sean_spectrum/
upload: ./sean_numbers.txt to s3://sean_spectrum/sean_numbers.txt
$ aws s3 ls s3://sean_spectrum/
2017-05-18 20:28:41 193 sean_numbers.txt
$

Note the names. The table name won’t turn out to be sean_numbers. It will be called sean_spectrum, and all files inside that directory will be queried. So make sure they have consistent formats!

Also: 30 questions to ask a serverless fanboy

Create & query your external table

Here’s how you create your external table. Note this is just a map to data. The data is still stored in S3. it is not brought into Redshift except to slice, dice & present.


mydb=# create external table spectrum_schema.sean_numbers(
id int,
fname string,
lname string,
phone string)
row format delimited
fields terminated by ','
stored as textfile
location 's3://sean_spectrum/';

Here’s how you query it:


mydb=# select * from spectrum_schema.sean_numbers order by id;
id | fname | lname | phone
----------------+---------------
12 | Scarlet | Witch | 23-35-47-555
15 | Wonder | Woman | 999-324-7878
21 | Dr. | Who | 44-22-55-77-88
32 | Quick | Silver | 22-33-77-99
35 | Bat | Man | 317-222-4777
75 | Marvel | Girl | 222-333-9595
99 | Storm | Cloud | 367-399-6767

Cool. We reordered data read from an S3 file!!!

Although you can’t create a view over a redshift table *AND* an S3 external table, you can query them together.

So for example if I have a table in redshift with addresses, I can join them together:

mydb=# select a.id, a.fname, a.lname, b.address from spectrum_schema.sean_numbers a, sean_addresses b
where a.id = b.id order by id;

id | fname | lname | phone | address
----------------+----------------------------
12 | Scarlet | Witch | 23-35-47-555 | 10 main st
15 | Wonder | Woman | 999-324-7878 | 25 center st
21 | Dr. | Who | 44-22-55-77-88 | 32 broadway
32 | Quick | Silver | 22-33-77-99 | 1 first st
35 | Bat | Man | 317-222-4777 | 99 west st
75 | Marvel | Girl | 222-333-9595 | 66 East Ave
99 | Storm | Cloud | 367-399-6767 | 50 North st

Also: What can startups learn from the DYN DNS outage?

Get more. Grab our exclusive monthly Scalable Startups. We share tips and special content. Our latest Why I don’t work with recruiters

Will SQL just die already?

With tons of new No-SQL database offerings everyday, developers & architects have a lot of options. Cassandra, Mongodb, Couchdb, Dynamodb & Firebase to name a few.

Join 33,000 others and follow Sean Hull on twitter @hullsean.

What’s more in the data warehouse space, you have Hadoop, which can churn through terabytes of data and get you results back before lunchtime!

So when I stumbled on this article SQL is 43 years old, I was intrigued.

Answer the questions you haven’t thought of

No-SQL databases are great if you know how you want to access the data. Users come from the users table, and that’s that!

But if later on you want to ask questions like, which users watched this video, which users are active, which users spent $100 in January? These questions may not be possible because NoSQL can’t join those other tables.

Relational databases shine when you need to aggregate your data, reorganize it, or ask unanticipated questions. And aren’t those most of the interesting questions?

Also: Top serverless interview questions for hiring aws lambda experts

Big Query, Redshift & even Hive speak SQL

I wrote that despite recent popularity in Hadoop, Redshift seems to be eating their lunch. And what would you know, surprise surprise, Amazon’s newish data warehousing solution, speaks SQL! What’s more there’s Apache Hive, which allows you to query Hadoop with, drumroll please… SQL!

Bigquery is the other major bigdata offering from none other than Google. And it too uses SQL!

Related: Which engineering roles are in greatest demand?

Still dominant

If you look at Stackoverflow’s developer survey, you’ll see that SQL is the second most popular language. Why might that be? For one thing it’s simple to learn. Enough that even business users can write simple requests, join & aggregate data.

Read: Can on-demand consulting save startups time & money?

Rugged, Proven & Open

SQL having been around so long is a fairly open standard. Sure there are extensions of it, but most of the basic stuff is there in all the products. That means you learn it once, and can interact with databases across the spectrum. That’s a win for everybody.

Also: 30 questions to ask a serverless fanboy

Business users can write it

Another under appreciated feature though is that basic queries are easy to write. They don’t require complex syntax like a hadoop job, or your favorite imperative programming language. The queries are readable, almost english-like sentences.

Given all that, it seems SQL is likely to be around for a long time to come!

Also: What can startups learn from the DYN DNS outage?

Get more. Grab our exclusive monthly Scalable Startups. We share tips and special content. Our latest Why I don’t work with recruiters

Key lessons from the Devops Handbook

I picked up a copy of the DevOps Handbook.

This is not a book about how to setup Amazon servers, how to use git, codePipeline or Jenkins. It’s not about Chef or Ansible or other tools.

Join 33,000 others and follow Sean Hull on twitter @hullsean.

This is a book about processes & people. It’s about how & why automation & world-class infrastructure will make your business more agile, raise quality & increase productivity.

1. Infrastructure in version control

With technologies like Terraform and CloudFormation, the entire state of your infrastructure can be captured. That means you can manage it just like any other code.

Also: Myth of five nines – Why high availability is overrated

2. Pushbutton builds

You’ve heard it before. Automate your builds. That means putting everything in version control, from environment building scripts, to configs, artifacts & reference data. Once you can do that, you’re on your way to automating production deploys completely.

Related: 5 ways to move data to amazon redshift

3. Devs & Ops comingled

In the devops world, devs should learn about operations, infrastructure, performance & more. What’s more operations teams should work closely with devs.

Read: Why were dev & ops siloed job roles?

4. Servers as cattle not pets

In the old days, we logged into servers & provided personal care & feeding. We treated them like pets.

In the new world of devops, we should treat servers like cattle. When it begins to fail, take it out back and shoot it. (tbh i don’t love the analogy, but it carries some meaning…)

Also: Are SQL databases dead?

5. Open to learnings & failures

Organizations that are open to failures, without playing the blame game, learn quicker & recover from problems faster.

Also: Is Amazon too big to fail?

Get more. Grab our exclusive monthly Scalable Startups. We share tips and special content. Our latest Why I don’t work with recruiters

Some irresistible reading for March – outages, code, databases, legacy & hiring

via GIPHY

I decided this week to write a different type of blog post. Because some of my favorite newsletters are lists of articles on topics of the day.

Join 32,000 others and follow Sean Hull on twitter @hullsean.

Here’s what I’m reading right now.

1. On Outages

While everyone is scrambling to figure out why part of the internet went down … wait is S3 is part of the internet, really? While I’m figuring out if it is a service of Amazon, or if Amazon is so big that Amazon *is* the internet now…

Let’s look at s3 architectural flaws in depth.

Meanwhile Gitlab had an outage too in which they *gasp* lost data. Seriously? An outage is one thing, losing data though. Hmmm…

And this article is brilliant on so many levels. No least because Matthew knows that “post truth” is a trending topic now, and uses it his title. So here we go, AWS Service status truth in a post truth world. Wow!

And meanwhile the Atlantic tries to track down where exactly are those Amazon datacenters?

Also: Is Amazon too big to fail?

2. On Code

Project wise I’m fiddling around with a few fun things.

Take a look at Guy Geerling’s Ansible on a Mac playbooks. Nice!

And meanwhile a very nice deep dive on Amazon Lambda serverless best practices.

Brandur Leach explains how to build awesome APIs aka ones that are robust & idempotent

Meanwhile Frans Rosen explains how to 0wn slack. And no you don’t want this. 🙂

Related: 5 surprising features in Amazon’s serverless Lambda offering

3. On Hiring & Talent

Are you a rock star dev or a digital nomad? Take a look at the 12 best international cities to live in for software devs.

And if you’re wondering who’s hiring? Well just about everyone!

Devs are you blogging? You should be.

Looking to learn or teach… check out codementor.

Also: why did dev & ops used to be separate job roles?

4. On Legacy Systems

I loved Drew Bell’s story of stumbling into home ownership, attempting to fix a doorbell, and falling down a familiar rabbit hole. With parallels to legacy software systems… aka any older then oh say five years?

Ian Bogost ruminates why nothing works anymore… and I don’t think an hour goes by where I don’t ask myself the same question!

Also: Are we fast approaching cloud-mageddon?

5. On Databases

If you grew up on the virtual world of the cloud, you may have never touched hardware besides your own laptop. Developing in this world may completely remove us from understanding those pesky underlying physical layers. Yes indeed folks containers do run in “virtual” machines, but those themselves are running on metal, somewhere down the stack.

With that let’s not forget that No, databases are not for containers… but a healthy reminder ain’t bad..

Meanwhile Larry’s mothership is sinking…(hint: Oracle) Does anybody really care? Now’s the time to revisit Mike Wilson’s classic The difference between god and Larry Ellison.

Read: Are SQL Databases Dead?

Get more. Grab our exclusive monthly Scalable Startups. We share tips and special content. Our latest Why I don’t work with recruiters

What products & improvements are new on AWS?

Amazon is releasing new products & services to it’s global cloud compute network at a rate that has all of our heads spinning.

Join 32,000 others and follow Sean Hull on twitter @hullsean.

Here’s new stuff worth mentioning around databases & data.

1. For ETL – AWS GLUE

Moving data from your transactional MySQL or Arora database to your reporting database isn’t always easy.

In the past you could use a service like xplenty or Alooma.

Now Amazon themselves are getting into the ETL game, providing a new service called Glue.

Also: RDS or Mysql? 10 use cases

2. Query S3 with Athena

Chances are if you’re using AWS for anything, you’ve got data in S3. And wouldn’t it be nice to pick that apart and dig through it, where it sits?

Oracle had a feature called “external tables” and MySQL had something similar. Now Amazon is offering that native within it’s own cloud universe. Thanks to some tricky lambda code, now you can do that. Don’t worry how they did it, because it’s been packaged into a nice easy service for your use!

Related: When you have to take the fall – consulting war stories

3. Business Intelligence with QuickSight

If you’re a data driven startup, and who isn’t these days, you’re going to have a business unit building reports. Tableau or Looker may be in your wheelhouse.

Amazon is obviously seeing the opportunity here, and competing with their own partners. Check out Amazon Quicksight for details.

Read: Is upgrading RDS like a sh*t storm that will not end?

4. Expanded RDS

RDS is obviously a very popular offering. And even though zero downtime is very hard to achieve with RDS, you’ll save plenty on DBAs and admins you don’t have to hire!

If you hadn’t heard, there is now MariaDB support. And with it, there’s a migration from MySQL to Mariadb as well.

Using Mariadb may bring you performance advantages & improvements. But RDS may mitigate this by productize & standarizing things.

You can also now move encrypted snapshots across regions. In my view this isn’t really a new feature, but rather fixing something that was broken before. The previous limitation was really more a symptom of their global network of data centers, than any built feature per se.

Also: Is the difference between dev & ops a four-letter word?

5. Expanded Redshift

As I’ve blogged before, everybody is excited about Redshift these days.

Amazon has introduced some new features.

o better loading of sorted data

This is done behind the scenes to load data quickly, and keep it stored efficiently. No more vacuuming after a big load!

o user & database rate limiting

Limit connections on a per user or per database level. Useful!

o storage estimates on analyze

When you perform the analyze command, you can get storage information so it’s easier to decide datatypes & compression type. Nifty!

Also: Is Redshift outpacing Hadoop as the big data warehouse for startups?

Get more. Grab our exclusive monthly Scalable Startups. We share tips and special content. Our latest Why I don’t work with recruiters

5 things you didn’t know about Dynamodb that are hurting you bad

amazon-dynamo-db

If you’re like a lot of folks you’re building an application in AWS & using a NoSQL database for persistent data. Dynamodb fits the bill nicely. Little or no ops to worry about, at least in the traditional sense.

Join 32,000 others and follow Sean Hull on twitter @hullsean.

However there are knobs to turn & dials to set. Here are a few you should be thinking about.

1. You can replicate across regions

Dynamodb introduced a feature in 2015 called streams. If you come from the relational database world, you can think of streams like a transaction log. It captures before & after image of your data. Couple those with useful lambda functions, and you have triggers that can do anything you want.

Turns out Amazon have been all over this, and already build a library to do cross-region replication with streams. Pretty cool!

Also: Is aws too complex for small dev teams?

2. You can manage retrieval costs

Dynamodb automatically creates and manages an index on the primary key. But chances are that your application will read data based on other columns too. You can create secondary indexes on these other columns, reducing your data access patterns. Without an index Dynamodb would have to scan every row to find your data, but the index can dramatically reduce this, and making data retrieval faster too!

Related: Does Amazon eat it’s own dogfood?

3. You can do SQL Like queries

That’s right, if you thought NoSQL meant no SQL you were only half right. By loading your Dynamodb data into HDFS, you can allow elastic map reduce to have at it. And thus open the door to use HiveQL to query the data the way you wanted to in the first place.

Convoluted? Yes. But this is the brave new world of the cloud!

Read: Is AMazon too big to fail?

4. Partitions are handy & useful

By default dynamo is partitioning your data behind the scenes. Because that’s what good distributed databases are supposed to do. It does so using the primary key to figure out where the data should go. And just like with Redshift you have option of also using sort key to help the optimizer figure out how to distribute the data. This is important. Going across those different instances brings a lot of latency costs that will surprise you.

Also: When hosting data on Amazon turned bloodsport

5. Metrics are your partner in performance

CloudWatch provides all sorts of instrumentation for Dynamodb. Read & write activity, throttling, errors & latency are just a few of the things you can see.

Also: Is aws the patient that needs constant medication?

Get more. Grab our exclusive monthly Scalable Startups. We share tips and special content. Our latest Why I don’t work with recruiters

Howto automate MySQL slow query analysis with amazon RDS

iRobot1

If you’ve used relational databases for more than ten minutes, I hope you’ve heard of slow queries. Those are those pesky little gremlins that are slowing down your startup, and preventing scalability you so desperately need.

Luckily there’s a solution. What I’ve found is if I send a report to developers every week, it keeps these issues front and center, for folks that are very busy indeed.

The script below is for RDS, but you can surely modify it if you have a physical server or roll-your-own MySQL box on Amazon. Take a look & enjoy!

Join 26,000 others and follow Sean Hull on twitter @hullsean.

1. install percona tools

Percona as many probably already know, are a wildly successful services firm that support MySQL and related technologies. They also have a very popular & scalable MySQL distribution by the same name.

Even if you’re not using Percona MySQL, you definitely want to get ahold of the percona toolkit. It provides all sorts of useful tools, including the one this article is based on, query-digest.

This tool takes your stock MySQL slow query logfile as input, and summarizes it into a very useful and readable report. Formerly mk-query-digest, it’s not called pt-query-digest. See below.

You can install the percona tools easily by grabbing the repository file and installing that with rpm. From there you can just use yum or apt-get depending on your distribution.

Related: Why a killer title can make or break your content efforts

2. install aws command line tool

Amazon has consolidated all it’s command line tools into a single one called just “aws”. The options can be a little arcane, and the error messages misleading besides. What’s good though is it is slightly easier to install & configure.

Do you already use Python? Install it this way:


$ pip install awscli

If not, you’ll need to dig into the aws cli installation instructions further.

Also: Do managers underestimate operational costs?

3. edit .aws/config

After you get the tool installed, you need to setup your environment. I edited a file named /home/shull/.aws/config as follows:


[default]
region = us-east-1
aws_access_key_id = BLIBJZMKLWIL5UTNRBMQ
aws_secret_access_key = MF5J/2z7HmN92lQUrV12ZO/FBXNjDVjL52TNRWsG

Those access_key_id and secret_access_key you can find on your amazon dashboard. Click upper right hand corner under your name, select the menu item “Security Credentials”.

Check out: Are SQL Databases Dead?

4. edit send_query_report.sh

I wrote the script below so you can fairly easily edit it.


#!/bin/bash
#

# get the rds db instanceID from command line (or crontab) entry
#
AWS_INSTANCE=$1

# here's where we'll store the latest slowquery.log
#
SLOWLOG=/tmp/rds_slow.log
#SLOWLOG=`/bin/ls -tr /home/shull/*.log | /usr/bin/tail -1`

# fetch slow query log from rds box
# here I always grab the latest one.
#
/usr/local/bin/aws rds download-db-log-file-portion --db-instance-identifier $AWS_INSTANCE --output text --log-file-name slowquery/mysql-slowquery.log > $SLOWLOG

# query report output
SLOWREPORT=/tmp/reportoutput.txt

# pt-query-digest location
MKQD=/usr/local/bin/pt-query-digest

# run the tool to get analysis report
$MKQD $SLOWLOG > $SLOWREPORT

# today's date in a variable
TODAY=`/bin/date +\%m/\%d/\%Y-\%H:\%S`
#YESTERDAY=`/bin/date -d "1 day ago" +\%m/\%d/\%Y-\%H:\%S`

# report subject
SUBJECT="Sean Query Report -- $TODAY "

# recipient
EMAIL="hullsean@gmail.com"

# send an email using /bin/mail
/usr/bin/mailx -s "$SUBJECT" "$EMAIL" < $SLOWREPORT

Note, if you don't have mailx installed, it should be available in your repository. Use apt-get or yum as necessary to get it installed.

Also: Is high availability overrated & near impossible to deliver?

5. Add to crontab

After you've tested the above script from command line, you will want to add it to a weekly cron job. Voila, automation! Don't forget to chmod +x to make it executable. 🙂


00 09 * * 5 /home/shull/send_query_report.sh seandb

Read: Are MySQL DBA's impossible to find?

Get more. Grab our exclusive monthly Scalable Startups. We share tips and special content. Our latest Why I don't work with recruiters

If you use MySQL in the Amazon cloud, you need to ask yourself this question

Join 25,000 others and follow Sean Hull on twitter @hullsean.

Are you serious about backups?

If you’re just using Amazon EBS snapshots, that may not be sufficient. There’s a good chance it won’t protect you against your next data loss.

That’s why I like to have a few different types of backups

Also: 5 more things deadly to scalability

Protect against operator error

mysqldump is a tool every DBA is familiar with. Same as a hotbackup or snapshot you say? Just more labor? Not true.

A dump allows you to restore one table, or one schema. That’s why they’re also known as logical backups. What’s more you can edit the file, remove indexes, change object names, or datatypes. All these can be essential in the screwy and unpredictable event of a real world outage.

Expect the unexpected!

Read: Why devops talent is in short supply

Test those backups regularly

If you haven’t actually tried to restore, you really don’t know if you have everything. Did you backup stored procedures & database code? How about grants? Database events? How about cronjobs? What about the my.cnf file? And your replication configuration?

Yes there are a lot of little pieces, and testing your backups by rebuilding everything is an attempt to poke holes in your plan, and hit issues before d-day!

Related: MySQL interview guide for managers and candidates alike

Replication isn’t a backup

Replication is getting better and better in MySQL. It used to fail regularly. MyiSAM was very unpredictable. But even in the comfortable realm of Innodb, there can still be data drift. If you’re on MySQL 5.0 or 5.1, you should consider performing regular checksums. These test the integrity of data and compare what’s actually in master & slave. Bulletproofing MySQL replication with checksums.

Read: Why high availability is so very hard to deliver

Have you considered security around your backup files?

While you’re thinking about backups, make sure the files themselves are secure. Remember they contain your crown jewels. Hopefully individual data that’s sensitive is encrypted, but still you should secure their final resting place as well.

If you’re using S3, consider encrypting the file before shipping it up to the bucket.

Read this: Why a four letter word divides dev and ops

Get more. Grab our exclusive monthly Scalable Startups. We share tips and special content. Our latest Why I don’t work with recruiters

Why startups need techops

devops divide

I was at a talk recently on node.js. Even if I’m not working with a technology directly, it’s exciting to see what’s out there, and node.js is bringing some hyper fast performance to a certain category of web applications.

During the keynote, the speaker mentioned a service to deploy applications on. I can’t name names unfortunately but it was a cloud solution on top of which you could deploy your application. Go this route
and you can do without an operations team. Avoid overhead of hiring ops, he claimed. And hey, then you can hire more developers!

To be fair I’ve heard much of the same thing at DBA or linux conferences. I can’t count the number of stories that start with “what some idiot developer did that took down our production systems…”.

Yes, it seems dev & ops are still just a tad bit adverserial.

Join 13,000 others and follow Sean Hull on twitter @hullsean.

1. My little known origins as a developer

Many colleagues and clients I’ve met in the New York City startup industry know me primarily as an operations & scalability guy. I tune databases, infrastructure and components to make things lightening fast.

I spent my earliest years at university on the computer lab operations staff. We watched and managed, made sure level zero backups were taken care of, and moved the tapes. Directly after college, I started at a software firm. I did C++ GUI development on the Mac, using the toolbox libraries with Metrowerks Codewarrior. I built split windows, and scroll bars, and displayed rows of data with nice resizable columns. All this wasn’t built into the class library, so for a lot of it we needed to roll our own solution.

We always had a long list of features coming from the business units. I also fielded many support calls, often from the windows platform as the code there hadn’t been managed and built as carefully. But that too was instructive as you could feel the pain of customers day-to-day challenges. It also illustrated the tradeoffs between new code and features, and existing bug fixes and support.

Also: Why generalists are better at scaling the web

2. A trip through the dot-com bubble as Oracle DBA

Through a circuitous path, I moved to New York in the mid-nineties and joined a startup. I had the opportunity to wear a lot of hats there, and apply my computer lab and Linux operating systems experience to the challenge of managed Oracle. I got a lot more involved with operations quick.

As the dot-com bubble grew, I saw a hot and growing demand for Oracle DBAs as most startups used Oracle, but the talent was in short supply. In one startup 80 million dollars was on the line as performance hobbled the website, and investors feared the worst.

Read: Why the Twitter IPO made a shocking admission about scalability

3. Different priorities & mandates

I remember working at Starmedia a media darling at the time. I was analyzing the database & server systems, and finding some code & jobs running during peak daytime hours. Management claimed that could not be the case. Yet for the next days and weeks I saw the same jobs running. I held strong and spoke truth to power as they say. That’s not always easy when you have a lot of investors, screaming CTOs and 100+ hour weeks. But eventually the source of the job was located, and disabled. And the website returned to it’s speedy self.

These experiences though do underline in my mind the different priorities and focus that developers and operations staff have.

Techops, system administrators & DBAs are typically averse to change. They fight it tooth and nail. That isn’t because they like to be curmudgeons though. They are typically very concerned about the business, but from a dramatically different perspective of stability, and reliability, even at 2am in the morning. They are concerned about the longevity of data, consistency, and durability of it.

Developers on the other hand have a different mandate. They are responsible for new business features, solutions to business requirements. Rapid prototyping & reactive or agile is embraced because it means you can deliver quicker to the business.

Crucially, both of these folks care very much for the business. Just with very different priorities.

Check this: Why AirBNB didn’t have to fail

4. Can developers do operations for you?

In a lot of small startups, the initial phase is obviously on building a product. That’s the build phase, and not surprisingly you hire a lot of developers. As you should. But as you grow you may find the operational tasks that are defaulting to one or more developers are taking more and more of their time. As your customer base grows and you’ve seen your first few spikes, it’s time to start thinking about hiring for a real ops role.

In summary, yes they can, but perhaps not well.

Related: How to hire a developer that you can work with

5. Volume discount, made to order or instant coffee

You may choose to go with instant coffee, by bringing someone in-house. You may find the right talent is hard to find. I wrote about this: Why techops and DBAs are in short supply.

Alternatively you may prefer a volume discount from one of the larger remote DBA or managed support solutions such as Oracle’s, Pythian or Percona. These guys all provide great service, but keep in mind how big of a fish you are. You’ll likely work through a ticketing system, and in some cases different engineers will look at your systems at different times. You will likely need either a very hands-on technical CTO or other in-house person to take ownership, and manage things closely.

The third option is a made-to-order coffee. Yes you pay more for Toby’s, Blue Bottle, or Ninth Street Espresso but you get what you pay for as they say. A boutique shop or independent consultant will provide a lot more hand holding, help your internal staff get up to speed, and communicate intimately about the process. If you’re a more non-technical CTO, or you’re very busy running the business, this solution may make a lot of sense for you.

Also: Why cloud detractors need a history lesson

Get more. Grab our exclusive monthly Scalable Startups. We share tips and special content. Our latest Why I don’t work with recruiters