Tag Archives: ec2

Does AWS have a dirty little secret?

tell a secret

I was recently talking with a colleague of mine about where AWS is today. Obviously there companies are migrating to EC2 & the cloud rapidly. The growth rates are staggering.

Join 32,000 others and follow Sean Hull on twitter @hullsean.

The question was…

“What’s good and bad with Amazon today?”

It’s an interesting question. I think there some dirty little secrets here, but also some very surprising bright spots. This is my take.

1. VPC is not well understood  (FAIL)

This is the biggest one in my mind.  Amazon’s security model is all new to traditional ops folks.  Many customers I see deploy in “classic EC2”.  Other’s deploy haphazerdly in their own VPC, without a clear plan.

The best practices is to have one or more VPCs, with private & public subnet.  Put databases in private, webservers in public.  Then create a jump box in the public subnet, and funnel all ssh connections through there, allow any source IP, use users for authentication & auditing (only on this box), then use google-authenticator for 2factor at the command line.  It also provides an easy way to decommission accounts, and lock out users who leave the company.

However most customers have done little of this, or a mixture but not all of it.  So GETTING TO BEST PRACTICES around vpc, would mean deploying a vpc as described, then moving each and every one of your boxes & services over there.  Imagine the risk to production services.  Imagine the chances of error, even if you’re using Chef or your own standardized AMIs.

Also: Are we fast approaching cloud-mageddon?

2. Feature fatigue (FAIL)

Another problem is a sort of “paradox of choice”.  That is that Amazon is releasing so many new offerings so quickly, few engineers know it all.  So you find a lot of shops implementing things wrong because they didn’t understand a feature.  In other words AWS already solved the problem.

OpenRoad comes to mind.  They’ve got media files on the filesystem, when S3 is plainly Amazon’s purpose-built service for this.  

Is AWS too complex for small dev teams & startups?

Related: Does Amazon eat it’s own dogfood? Apparently yes!

3. Required redundancy & automation  (FAIL)

The model here is what Netflix has done with ChaosMonkey.  They literally knock machines offline to test their setup.  The problem is detected, and new hardware brought online automatically.  Deploying across AZs is another example.  As Amazon says, we give you the tools, it’s up to you to implement the resiliency.

But few firms do this.  They’re deployed on Amazon as if it’s a traditional hosting platform.  So they’re at risk in various ways.  Of Amazon outages.  Of hardware problems under the VMs.  Of EBS network issues, of localized outages, etc.

Read: Is Amazon too big to fail?

4. Lambda  (WIN)

I went to the serverless conference a week ago.  It was exiting to see what is happening.  It is truely the *bleeding edge* of cloud.  IBM & Azure & Google all have a serverless offering now.  

The potential here is huge.  Eliminating *ALL* of the server management headaches, from packages to config management & scaling, hiding all of that could have a huge upside.  What’s more it takes the on-demand model even further.  YOu have no compute running idle until you hit an endpoint.  Cost savings could be huge.  Wonder if it has the potential to cannibalize Amazon’s own EC2 …  we’ll see.

Charity Majors wrote a very good critical piece – WTF is Operations? #serverless
WTF is operations? #serverless

Patrick Dubois 

Also: Is the difference between dev & ops a four-letter word?

5. Redshift  (WIN)

Seems like *everybody* is deploying a data warehouse on Redshift these days.  It’s no wonder, because they already have their transactional database, their web backend on RDS of some kind.  So it makes sense that Amazon would build an offering for reporting.

I’ve heard customers rave about reports that took 10 hours on MySQL run in under a minute on Redshift.  It’s not surprising because MySQL wasn’t built for the size servers it’s being deployed on today.  So it doesn’t make good use of all that memory.  Even with SSD drives, query plans can execute badly.

Also: Is there a better way to build a warehouse in 2016?

Get more. Grab our exclusive monthly Scalable Startups. We share tips and special content. Our latest Why I don’t work with recruiters

Is AWS too complex for small dev teams & startups?

via GIPHY

I was discussing a server outage with a colleague recently. AWS had done some confusing things, and the team was rallying to troublehsoot & fix.

He made an offhand comment that caught my attention…


AWS is too complex for small dev teams. I’d recommend we host in a traditional datacenter.

Join 32,000 others and follow Sean Hull on twitter @hullsean.

It’s an interesting point. For all the fanfare over Amazon, lost in the shuffle is the staggering complexity that we’re taking on. For small firms, this is a cost that’s often forgotten when we smell the on-demand cool-aid that is EC2.

Here are my thoughts…

1. Over 70 services offered

Everytime I login to the AWS console there’s a new service offering. Lambda & serverless computing. CodeDeploy, Redshift, EMR, VPC’s, developer tools, IOT, the list goes on. If you haven’t enabled MFA on your IAM accounts you’re not alone!

Also: Is Amazon too big to fail?

2. Still complex to build high availability

The song I hear out of Amazon is, we offer all the components for a high availability infrastructure. multiple availability zones, regions, load balancers, autoscaling, geo & latency dns routing. What’s more companies like Netflix have open sourced tools to help.

But at a lot of startups that I see, all these components are not in use, nor are they well understood. Many admins are still using Amazon like an old-school datacenter. And that’s not good.

Sometimes it seems that AWS is a patient in need of constant medication.

Related: Are we fast approaching cloud-mageddon?

3. Need a dedicated devops

As AWS becomes more complex, and the offering more robust, so too the need for dedicated ops. If you’re devs are already out of bandwidth, but you don’t quite have so much need for a fulltime resource a consultant may be an option. Round out the team & keep costs manageable.

If you’re looking for an aws solutions architect, we can help!

Check out: Does Amazon eat it’s own dogfood?

4. Orchestration involves many moving parts

Infrastructure as code offers the promise of completely versioning all your servers, configurations and changes. From there we can apply test driven development & bring a more professional level of service to our business. That’s the theory anyway.

In practice it brings an incredible number of new toolsets to master and a more complex stack besides. All those components can have bugs, need troubleshooting. This sometimes just kicks the can down the road, moving the complexity elsewhere.

It’s not clear that for smaller shops, all this complexity is manageable.

Also: 5 things toxic to scalability

5. Troubleshooting failed deployments

I was looking at a problem with a broken deploy recently. Turns out a developer had copy & pasted some code solution off the internet, possibly from a tutorial, and broke deployments to staging.

Yes perhaps this was avoidable, and more checks & balances can fix. But my thought is continuous integration & continuous deployments are not a panacea. More complexity brings a more complex web to unweave.

I sometimes wonder if we aren’t fast approaching cloud-mageddon?

Read: Why Airbnb didn’t have to fail?

Get more. Grab our exclusive monthly Scalable Startups. We share tips and special content. Our latest Why I don’t work with recruiters

Does Amazon eat it’s own dog food (ahem…) or drink it’s own champagne?

laura grit aws amazon retail champagne

I was flipping through the AWS reddit channel and found this excellent presentation from RE:Invent by Laura Grit. She’s in charge of Amazon Retail, and worked very closely with teams on migrating to AWS. She goes in-depth on what that cost in terms of development, what it saved in terms of unused capacity, and surprisingly operational headaches.

Join 32,000 others and follow Sean Hull on twitter @hullsean.

Laura’s a great speaker. I was surprised to find that Amazon Retails migration was similar to many of the customers I’ve worked with in New York. Often they take a hybrid approach where Direct Connect is key, allowing them to move over in a measured way.

What’s more she talks about how EC2 instances have different performance characteristics & applications typically need to be tuned for that world.

I learned a lot more, here are the highlights…

1. Hybrid cloud was key

Around 11:00 in the video she talks about AWS Direct Connect & VPC. These two technologies allow you to leverage AWS as a hybrid cloud, connecting to your existing datacenter. Scale elastically, but migrate in steps.

For example Amazon Retail did only webserver fleet in isolation.

Also: Is Amazon too big to fail?

2. Excite business & developers both

Around 18:20 …

“Moving the webserver fleet not only got the business excited about the cost savings & our ability to scale linearly, but also got developers excited about the operational load decrease that they had to burden.

Once benefits of this were shown to the rest of the company it actually jump started a wave of migrations to ec2 from inside amazon retail. And we found from a program perspective this is important. To find early migrations that benefit both the business & the developers because then they are both working together to figure out how to move their services to AWS.”

And she also pointed out an interesting bit abaout cultural change…

“You may choose to not migration the simplest service from inside your company, but instead one that will create a cultural change in the company & force more migrations automatically to AWS.”

Related: Are SQL Databases dead?

4. Expect application changes

Flip through to 27:47 and she talks about application changes for the new environment of the cloud.

“Don’t expect migrations to require no changes to your applications…
The webserver fleet was not lift & shift”

Also: Why Dropbox didn’t have to fail

5. Cloud not a panacea

Fast forward over to 37:10 and you’ll hear Laura talk about technical debt. That’s big.

“The cloud is not a universal panacea. It can’t coverup for messy engineering practices.
An example of this is availability. Design for failure is a fundamental design principle of amazon.”

Also: Are generalists better at scaling the web?

Get more. Grab our exclusive monthly Scalable Startups. We share tips and special content. Our latest Why I don’t work with recruiters

Is AWS the patient that needs constant medication?

storm coming

I was just reading High Scalability about Why Swiftype moved off Amazon EC2 to Softlayer and saw great wins!

We’ve all heard by now how awesome the cloud is. Spinup infrastructure instantly. Just add water! No up front costs! Autoscale to meet seasonal application demands!

But less well known or even understood by most engineering teams are the seasonal weather patterns of the cloud environment itself!

Join 28,000 others and follow Sean Hull on twitter @hullsean.

Sure there are firms like Netflix, who have turned the fickle cloud into one of virtues & reliability. But most of the firms I work with everyday, have moved to Amazon as though it’s regular bare-metal. And encountered some real problems in the process.

1. Everyday hardware outages

Many of the firms I’ve seen hosted on AWS don’t realize the servers fail so often. Amazon actually choosing cheap commodity components as a cost-savings measure. The assumption is, resilience should be built into your infrastructure using devops practices & automation tools like Chef & Puppet.

The sad reality is most firms provision the usual way, through the dashboard, with no safety net.

Also: Is your cloud speeding for a scalability cliff

2. Ongoing network problems

Network latency is a big problem on Amazon. And it will affect you more. One reason is you’re most likely sitting on EBS as your storage. EBS? That’s elastic block storage, it’s Amazon’s NAS solution. Your little cheapo instance has to cross the network to get to storage. That *WILL* affect your performance.

If you’re not already doing so, please start using their most important & easily missed performance feature – provisioned IOPS.

Related: The chaos theory of cloud scalability

3. Hard to be as resilient as netflix

We’ve by now heard of firms such as Netflix building their Chaos Monkey to actively knock out servers, in effort to test their ability to self-healing infrastructure.

From what I’m seeing at startups, most have a bit of devops in place, a bit of automation, such as autoscaling around the webservers. But little in terms of cross-region deployments. What’s more their database tier is protected only by multi-az or just a read-replica or two. These are fine for what they are, but will require real intervention when (not if) the server fails.

I recommend building a browse-only mode for your application, to eliminate downtime in these cases.

Read: 8 questions to ask an aws expert

4. Provisioning isn’t your only problem

But the cloud gives me instant infrastructure. I can spinup servers & configure components through an API! Yes this is a major benefit of the cloud, compared to 1-2 hours in traditional environments like Softlayer or Rackspace. But you can also compare that with an outage every couple of years! Amazon’s hardware may fail a couple times a hear, more if you’re unlucky.

Meanwhile you’re going to deal with season weather problems *INSIDE* your datacenter. Think of these as swarms of customers invading your servers, like a DDOS attack, but self-inflicted.

Amazon is like a weak immune system attacking itself all the time, requiring constant medication to keep the host alive!

Also: 5 Things toxic to scalability

5. RDS is going to bite you

Besides all these other problems, I’m seeing more customers build their applications on the managed database solution MySQL RDS. I’ve found RDS terribly hard to manage. It introduces downtime at every turn, where standard MySQL would incur none.

In my experience Upgrading RDS is like a shit-storm that will not end!

Also: Does open source enable the cloud?

Get more. Grab our exclusive monthly Scalable Startups. We share tips and special content. Our latest Why I don’t work with recruiters

If you use MySQL in the Amazon cloud, you need to ask yourself this question

Join 25,000 others and follow Sean Hull on twitter @hullsean.

Are you serious about backups?

If you’re just using Amazon EBS snapshots, that may not be sufficient. There’s a good chance it won’t protect you against your next data loss.

That’s why I like to have a few different types of backups

Also: 5 more things deadly to scalability

Protect against operator error

mysqldump is a tool every DBA is familiar with. Same as a hotbackup or snapshot you say? Just more labor? Not true.

A dump allows you to restore one table, or one schema. That’s why they’re also known as logical backups. What’s more you can edit the file, remove indexes, change object names, or datatypes. All these can be essential in the screwy and unpredictable event of a real world outage.

Expect the unexpected!

Read: Why devops talent is in short supply

Test those backups regularly

If you haven’t actually tried to restore, you really don’t know if you have everything. Did you backup stored procedures & database code? How about grants? Database events? How about cronjobs? What about the my.cnf file? And your replication configuration?

Yes there are a lot of little pieces, and testing your backups by rebuilding everything is an attempt to poke holes in your plan, and hit issues before d-day!

Related: MySQL interview guide for managers and candidates alike

Replication isn’t a backup

Replication is getting better and better in MySQL. It used to fail regularly. MyiSAM was very unpredictable. But even in the comfortable realm of Innodb, there can still be data drift. If you’re on MySQL 5.0 or 5.1, you should consider performing regular checksums. These test the integrity of data and compare what’s actually in master & slave. Bulletproofing MySQL replication with checksums.

Read: Why high availability is so very hard to deliver

Have you considered security around your backup files?

While you’re thinking about backups, make sure the files themselves are secure. Remember they contain your crown jewels. Hopefully individual data that’s sensitive is encrypted, but still you should secure their final resting place as well.

If you’re using S3, consider encrypting the file before shipping it up to the bucket.

Read this: Why a four letter word divides dev and ops

Get more. Grab our exclusive monthly Scalable Startups. We share tips and special content. Our latest Why I don’t work with recruiters

5 cloud ideas that aren’t actually true

storm coming

Join 20,000 others and follow Sean Hull’s scalability, startup & innovation content on twitter @hullsean.

Cloud computing is heralding us into a wonderful era where computing can be bought in small increments, like a utility. This changes the whole way we plan, manage budgets, and accelerates startups making them more agile.

But it’s not all wine & roses up there. I’ve heard a few refrains from clients over the years, and thought I’d share some of the most common.

1. Scaling is automatic

Rather recently I was working with a client on building some sophisticated reports. They needed to slice & dice customer data, over various time series, and summarize with invoices & tracking data. Unfortunately their dataset was large, in the half terabyte range.


Client: Can we just load all this data into the cloud?
Me: Yes we can do that. Build a system in Amazon public cloud, can support large datasets.
Client: I want it to scale easily. So we won’t have these slow reports. And as we add data, it’ll just manage it easily for us.
Me: Well it’s a little bit more complicated than that, unfortunately.

Unfortunately this is a rather familiar conversation that I have quite often. A lot of the press around cloud scalability, centers around auto-scaling, Amazon’s renowned & superb virtualization feature. Yes it’s true you can roll out webservers to scale out this way, but that’s not the end of the story. Typically web applications have a lot of components, from caching servers, to search servers, and of course their backend datastore.

But can we scrap our relational database, such as MySQL and go with one that scales out of the box like Riak, Cassandra or Dynamodb?

Those NoSQL solutions are built to be distributed from the start, it’s true. And they lend themselves to that type of architecture. However, if you’ve built up a dataset in MySQL or Oracle, and more so an application around that, you’ll have to migrate data into the NoSQL solution. That process will take some time.

Like teaching a fish to fly, it make take some time. They do well in water, but evolution takes a bit longer.

Related: RDS or MySQL 10 use cases

2. Disaster recovery is free

In the traditional datacenter, when you want DR, you setup a parallel environment. Hopefully not in the same room, same city or same coast even. Preferrably you do so in a different region. What you can’t get around is dishing out cash for that second datacenter. You need the servers, just in case.

In the cloud, things are different. That’s why we’re here, right? In amazon you have regions already setup & available for plugin-n-play use. Setup your various components, servers, software & configure. Once you’ve verified you can failover to the parallel environment you can just turn off all those instances. Great, no big charges for all that iron that you’d pay for to keep the rooms warm in an old-school datacenter. Or do you?

As it turns out, since you don’t have this environment running all the time, you’ll want to test it more often, run fire drills to bring the servers back online. That’ll incur some costs in terms of manpower. You’ll also want to include in there some scripts to start those servers up, and/or some detailed documentation on how to do that. And don’t lose that documentation, either will you?

You may also want to build some infrastructure as code unit tests. Things change, code checkouts evolve, especially in the agile & continuous integration world. Devops beware!

Read this: Why a killer title can make or break your content efforts

3. Machines are fast

Fast, fast, fast. That’s what we expect, things keep getting faster, right? Hard to believe then that the world of computing took a big step backward when it jumped into the cloud. Something similar happened when we jumped to commodity Linux a decade ago.

In amazon, it’s a multi-tenant world. And just like apartment buildings, popular restaurants, or busy highways you must share. When things are quiet you may have the road to yourself, but it’ll never be as quiet as a dirt road in the country!

Amazon is making big strides though. They now offer memory optimized & storage optimized instances. And an even bigger development is the addition of the most important feature for performance & scalability. That said the network & EBS can still be a real bottleneck.

Also: What is a relational database & why is it important?

4. Backups aren’t necessary

I’ve experienced a few horror stories over the years. I wrote about one noteworthy one When fat fingers take down your business.

True EBS snapshots make backing up your whole server, well a snap! That said a few extra steps have to happen (flush the filesystem & lock tables) to make this work for a relational database like MySQL or Oracle. And suddenly you have a verification step that you also need to perform. You see no backups are valid until they’ve been restored, remember?

But even with these wonderful disk snapshots, you’ll still want to do database dumps, and perhaps table dumps. Operator error, deleting the wrong data, or dropping the wrong tables, will always be a risk. Ignore backups at your own peril!

Check this: Why CTOs underestimate operational costs

5. Outages won’t happen

In an ideal world, everything is redundant, and outages will be a thing of the past. We’ll finally reach five nines uptime and devops everywhere will be out of work. :)

It’s true that Amazon provides all the components to build redundancy into your architecture, and very cutting edge firms that have taken netflix’s approach with chaos monkey are seeing big improvements here. But AirBNB did fail and at root it was an Amazon outage that shouldn’t ever happen.

Read: Why Oracle won’t kill MySQL

Get more. Monthly insights about scalability, startups & innovation.. Our latest Are SQL Databases Dead?

Connect to MySQL in the Amazon Public Cloud

MySQL on Amazon Cloud AWS

Troubleshooting MySQL on Amazon can be a real test of patience. There are quite a few different things to watch out for in terms of connectivity & networking. Sometimes a checklist can help.

Join 16,000 others and follow Sean Hull on twitter @hullsean.

Here’s my exhaustive list of things that can block you.

1. Be sure to create users & grants

Chances are you did something like this to create your user:


mysql> CREATE USER ‘sean’@‘localhost’ IDENTIFIED BY ‘password’;
mysql> GRANT ALL PRIVILEGES ON sean_schema.* TO ‘sean’@‘localhost' WITH GRANT OPTION;

But that won’t help you when connecting from a remote Amazon box. So what to do? Here’s an example:


mysql> CREATE USER ‘sean’@’10.10.%’ IDENTIFIED BY ‘password’;
mysql> GRANT ALL PRIVILEGES ON sean_schema.* TO ‘sean’@‘%’ WITH GRANT OPTION;

You may need to make your source IP wildcard *more* aggressive. For example consider ’10.%’. You *may* even with with ‘%’ which allows *all* source IPs. This may sound dangerous, but if you use a tight security group (see item #3 below), you can still be safe.

Related: Why Oracle Won’t Kill MySQL

2. Make sure iptables is not a problem

IPTables is a Linux service that acts like a private firewall for each server. Some AMIs will have it enabled by default. If you’re having trouble like I did, this can definitely trip you up. That’s because your connection will fail silently without telling you, hey the OS won’t let me into that port!

If you are a networking pro you’ve probably already fiddled with iptables. Feel free to add specific rules, and keep it turned on. However I’d recommend just disabling it completely, and using your Amazon security groups to protect your ports.


$ /etc/init.d/iptables stop
$ chkconfig --list iptables
iptables 0:off 1:off 2:on 3:on 4:on 5:on 6:off
$ chkconfig --del iptables
$ chkconfig --list iptables
service iptables supports chkconfig, but is not referenced in any runlevel (run 'chkconfig --add iptables')

Also: Are SQL Databases Dying Out?

3. Test & verify amazon security group settings

Security groups in Amazon can be tricky. I recommend the following:

o create a security group webserver_group
– allow port 80 from 0.0.0.0/0
– allow port 443 from 0.0.0.0/0
– allow port 22 from

o create a security group db_group
– allow port 22 from
– allow 3306 from

What’s happening here? We can’t specify a fixed set of IP addresses because they can change in Amazon. So essentially what we’ve done is say *any* requests from servers in our Amazon package, which are in the webserver_group security group, can connect to port 3306. Pretty cool right?

This means we’re pretty locked down. No internet connections to 3306, so we can be a little looser (see item #1 above) about our grants and source IPs.

What about if you want to use your GUI tools to hit your Amazon hosted MySQL boxes? Say you like to use the Oracle Workbench, Navicat or Toad to connect to MySQL. One way you could do this is configure your db_group to allow 3306 from your office subnet. Then anyone VPN’d into your office will be able to use the tools they like.

Another option is to use Amazon VPC for your servers. You’ll setup an Amazon Virtual Private Gateway, which is a direct VPN connection between Amazon’s datacenter and your datacenter. This can be a messy process, and you’ll want to contact your network admin to help. Once it’s setup, amazon boxes appear to sit on your office or datacenter network. Cool stuff!


$ mysql -h xxx.xxx.xxx.xxx -u admin -p
Enter password:
ERROR 2003 (HY000): Can't connect to MySQL server on 'xxx.xxx.xxx.xxx'

Read this: Why are MySQL experts in such short supply?

4. MySQL network settings

If MySQL is bound to the wrong IP address you can have real problems. First be sure skip_networking is OFF. If it is ON change it in /etc/my.cnf & restart MySQL.


mysql> show variables like 'skip_net%';
+-----------------+-------+
| Variable_name | Value |
+-----------------+-------+
| skip_networking | OFF |
+-----------------+-------+
1 row in set (0.00 sec)

The other MySQL setting that can be problematic is bind-address. First check what it is set to:


$ cat /etc/my.cnf | grep bind
bind-address=127.0.0.1

This isn’t going to allow remote connections. In amazon however, your IP address may change upon reboot. So there is a special setting to allow binding to any IP:


bind-address=0.0.0.0

Related: Bulletproofing MySQL Replication with Checksums

5. installing mysql client & telnet for troubleshooting

You have two options for troubleshooting on the webserver side. If you’re simply trying to check by mysql command line, you may get blocked up if the network settings & security groups aren’t configured right. So use telnet first.


$ yum install -y telnet

$ telnet 10.10.10.1 3306
Trying 127.0.0.1...
Connected to localhost.
Escape character is '^]'.
4
5.1.71??gu9Y6B'/y9Oay`QV

If you don't get a responce, it's not an issue with users or grants, but rather that the port isn't opened. Check iptables, check bind-address and check security groups.

Check this: Top MySQL DBA Interview Questions

6. SE Linux related issues

SE Linux will do a lot of good, if managed properly. However if you're not aware of it's existence, it can be very very frustrating. Symptoms can be as abstract as allergies, a cold or flu. It can monitor files, and prevent MySQL from being able to write where it needs to,

Read this: Migrating MySQL to Oracle

7. RPM & later centos yum repo install conflicts

I had real problems doing a custom install for a customer. They didn't want to use a repository for various settings, but preferred downloading RPMs. There were a few other customizations which were tripping things up.

Based on all the connectivity issues I was having, I backed out of the RPM based install, and then ran through a stock yum install. After doing that, I started seeing these weird errors in the mysqld.log

120328 21:32:40 [ERROR] Can't start server: Bind on TCP/IP port: Address already in use
120328 21:32:40 [ERROR] Do you already have another mysqld server running on port: 3306 ?
120328 21:32:40 [ERROR] Aborting
If I run "netstat -nat | grep 3306" in my terminal, I get the following:
tcp4 0 0 *.3306 . LISTEN

I spent hours spinning my wheels and not able to figure out what was happening here. At first it seemed a leftover pid file was the culprit. In the end it appeared the *old* /etc/init.d/mysql script was still in place, and the new yum packages wouldn't work with that.

I ended up just scrapping the whole box, and starting from scratch. Sometimes you have to do that. After a clean build, all was fine.

Related: RDS or MySQL 10 Use Cases

Get more. Grab our exclusive monthly Scalable Startups. We share tips and special content. Our latest Why I don't work with recruiters

Business Agility at AWS re:Invent

Also find Sean Hull’s ramblings on twitter @hullsean.

Although I couldn’t be in Vegas to attend re:Invent, there is so much online it’s almost better than being at the conference. From an ongoing live stream of keynotes and sessions, to an archived collection on Youtube.

The big wins

You may have heard of all the great things that Amazon or cloud computing can do, but I thought Andy Jassy summarized these nicely in these six points.

1. Replace capex with opex
2. lower total costs of ownership
3. no guessing about capacity
4. encourage agility & innovation
5. differentiation
6. global from the start

Redshift

By far the biggest announcement at the show is Amazon’s new Redshift product. It is a fully managed datawarehouse solution that scales to petabytes in it’s cloud. Currently there are two business intelligence tools that are supported namely Jaspersoft and Microstrategy.

[quote]
In 2003 Amazon was a 5 billion dollar company. Today AWS adds the same infrastructure capacity everyday to it’s availability zones!
[/quote]

Reduced prices by 25% for S3

As a lot of folks know, Amazon has always been about cheaper prices. That model has been disruptive in the book selling industry, and in a huge way in the infrastructure and datacenter industry. As more customers signup, economies of scale mean they can offer the same hardware & services for lower prices.

With that they’re announcing lower prices for S3 by a whopping 25%. To me this speaks to their continuing push to dominate the market by driving prices downward.

Amazon’s Channel on Youtube

If you weren’t able to attend the conference, or want to recap some highlights you might have missed, they have put up a great AWS Channel on Youtube.

Some of the speakers include Sharon Chiarella VP Mechanical Turk, Glenn Hazard, CEO, Xceedium, Todd Barr CMO of Alfresco talks, Bright Fulton, Operations for Swipely, Colin Percival, FreeBSD Developer, Ted Dunning, Chief Application Architect of MapR Technologies, James Broberg, CTO & Founder of MetaCDN, Mitchell Garnaat, Sr. Engineer, David Etue, Vice President, SafeNet, and Mike Culver, Sr. Consultant to name just a few.

Read this far? Grab our Scalable Startups for more tips and special content.

Cloud Operations Interview

What does a cloud computing expert need to know? How do you hire a cloud computing expert? Competition for operations & DBAs is fierce, so you’ll want to know how to find the best.

If you’re a systems administrator or ops guy, you may want to prepare for an interview for such a position. Meanwhile, if you’re a director of it or operations, a recruiter or manager in HR, you’ll want to have some idea how to find the right candidate.

Here’s my guide to do just that. You may also jump to part two Cloud Deployment Interview or the last part three Cloud DBA, Architecture and Management Interview.

1. Solid unix systems administrator

At the top of the list, a cloud operations expert needs to understand Unix and more importantly Linux. Here are some sample questions to get the conversation moving:

o What is web operations and what have you done day-to-day?

Prepare some stories.

o What’s your favorite feature of the linux kernel?

This is an open ended question, but a systems administrator should have some knowledge here. The kernel is the most basic piece of software that runs when a computer boots up, whether it is a desktop or a server. This piece of software coordinates everything, manages resources, and directs traffic.

o Name some distributions of linux. What is a distro?

Linux is built by a collaborative team of thousands on the internet. That’s what makes it open source. The distributions, include the operating system, along with a collection of software to go along with it. All the supporting utilities, libraries and servers must be compiled and held in a repository. That’s what makes up a distribution. Debian, Redhat and Ubuntu are a few popular ones.

[quote]
A cloud operations expert needs to have a wide ranging skillset, from unix administration, architecture, scalability, database & webserver administration, troubleshooting & performance, load & stress testing. You’ll also want someone who has learned hard lessons from some failures, has some war stories to tell and has a hard nose for stability.
[/quote]

o What’s the difference between apache and nginx?

These two pieces of software are both webservers, that is they respond to the HTTP protocol, and can serve HTML pages. They also have a myriad of plugins to support different languages and features. The difference? Nginx (pronounced engine-X) is a newer incarnation. It’s been rearchitected from the ground up, building on all the things learned from Apache over the years. Its tighter, more efficient code, and easier to configure.

You might also enjoy our Intro to EC2 Cloud Deployments Guide.

o What is a key value store? examples?

There are lots of examples of these types of databases. They are a very simple memory cache that can interface with most applications. Memcache is a popular example of a key value store. Redis, CouchDB and Voldemort can also do this.

o What is a page cache? Reverse proxy cache? examples?

These are all the same thing. They are basically a very minimal webserver without all the plugins or bells and whistles. You put one of these in front of your webserver to handle all the easy stuff, and speed up overall throughput. Varnish is a popular example.

o What filesystem do you prefer?

This is a bit arcane, but one should have some opinions here. xfs is a popular filesystem, though ext3 and ext4 are also common. Emphasize the journaling aspect here. Journaling means that if you pull the cord or your server crashes, the filesystem can recover upon reboot. It does this by journaling changes, much how a database keeps a redolog cache of recent changes to database tables.

o Command line tools

There are lots of commands in the day-to-day toolbox of a web ops expert. Here are some examples:
rsync (pronounced our-sync) – sync files between servers & do checksums to allow easy restarts
scp (pronounced s-c-p) – secure copy, similar to rsync but no checksums, so less reliable
curl (pronounced kurl) – diagnose & test urls and HTTP from the command line
cron (pronounced cron) – run commands at scheduled times
ssh (pronounced s-s-h) – secure shell, the most basic tool to reach a cloud server
ifconfig (pronounced if-config) – check the network interfaces on the server
vi/emacs (pronounced v-i and e-macks) – terminal editors, to modify config files
uptime (pronounced up-time) – display the current load average of the server
top (pronounced top) – interactive display of system metrics like memory, load, swap & processes
ps (pronounced p-s) – shows running processes on the server
/var/log/messages – essential system logfile

o What are application servers? How are they different from webservers?

Tomcat & Glassfish are two examples of application servers. These handle heavier weight languages & applications like Java. Application server on some level is just a more heavyduty webserver and these days Apache can be thought of as an application server also.


2. Cloud concepts

o What is virtualization? What is a hypervisor?

Virtualization allows you to run one or more computers within a computer. You can do virtualization on a desktop, sharing network, memory, cpu and disk resources among a number of virtual servers. But more importantly in cloud computing or IaaS offerings you can do virtualization at the datacenter level. The hypervisor layer is a datacenter virtualization technology that provisions server resources, and balances shared network and disk resources.

o What is an image?

In Amazon the world, the AMI or amazon machine image is a snapshot of a server state at one moment in time. This image is take at the block level, and includes the master block record, the first block on disk that a server boots from. All that is the state of a server, when it is shutdown, is what is stored on disk or in this image. All config files, logfiles, and anything else writing to disk.

o What is multi-tenant?

This means that there are multiple servers sharing resources. The tenants are the customers who each want to get the server, cpu, memory, network and disk that they paid for.

o What is the downside to shared resources?

Contention for resources is always the challenge. If your fellow tenants are not very thirsty, this can work to your advantage. But if they’re also heavy users, the hypervisor layer has manage the balancing act. You may get a spike of disk I/O at one point, but later get a dearth. This can cause a relational database like MySQL or Oracle to suddenly look stalled.

o What is instance-store? What is ebs?

Instance store servers were Amazon’s original offering, where servers had their own local (and slow) storage. This storage was ephemeral, so all machine state was lost on reboot. These servers also boot slowly. EBS also known as elastic block storage is a virtualized storage option, similar to NAS or NFS. You can create arbitrary chunks of storage, and attach them to servers, all from command line APIs. Cool!

o What is virtual private cloud?

With the VPC offering, Amazon drops a router into your existing datacenter. You can then provision virtual servers to your hearts content, and they all appear to be servers in your existing datacenter. Elastically scale, within the network and security model you’re already using.

o What is a hybrid approach to cloud adoption?

Keeping your investments in hardware and datacenter is obviously an appealing option for firms that have large existing environment. A hybrid approach with a VPC allows you to get your feet wet, but still keep essential applications on physical servers.

o What is Amazon EC2?

Elastic Compute Cloud refers to the virtual servers you spinup in Amazon Web Services.

o What is Amazon RDS, Oracle RDS, Mysql RDS?

Amazon has various relational and non-relational database offerings. RDS stands for relational database service.

RDS or roll your own – which is better? Here are some use cases to help you decide.

o What is multi-az?

Amazon’s infrastructure offering isn’t just a single datacenter with servers. The beauty of what they’ve built is that they offer a number of datacenters (called availability zones) in each of many regions such as Northern Virginia, Oregon and Singapore.

Incidentally multi-az is a key feature to how businesses can protect themselves from failure. Amazon recently had an outage, but AirBNB, Reddit & Foursquare didn’t have to fail.

o What does a CDN do? How does it work? examples?

A CDN is a content delivery network. Remember all those files that make up a webpage? Images, video, css files? Turns out serving these components from servers *closer* to your customer, make their webpages load much faster. CDNs are networks of servers that hold the content of your pages, and serve them faster.

It works by replacing content paths with a special one from your provider. A simple change in your code will allow content to dynamically load from across the web. Cool!

CloudFront is Amazon’s offering coupled with S3 for file storage. Akamai is another big provider.

We’re not done yet. In part two on deployments and http://www.iheavy.com/2012/11/01/cloud-deployment-interview/”>part three of this series, we’ll hit on other important skills a cloud ops expert should have including scripting, database administration (Our MySQL Interview Guide), scalability, performance, configuration management, metrics, monitoring, and some all important war stories!

Here are some questions to pique your interest:

o Why does the API battle between Amazon & Eucalyptus (FOSS) matter?
o Do you use command line tools? why?
o What can go wrong with backups? how do we test them?
o Should we encrypt filesystems in the cloud? what are the risks?
o Should we use offsite backups?
o What is DRBD?
o Why is auditing important? access control?
o What is load balancing? why is it difficult with databases?
o How do you perform a benchmark? perform load testing?
o Why use a package manager? can we install from source?

Our Deploying MySQL on Amazon EC2 Guide is also related to this interview process.

You may also jump to part two Cloud Deployment Interview or the last part three Cloud DBA, Architecture and Management Interview.

Read this far? Grab our newsletter – startup scalability.

AirBNB didn't have to fail

Today part of Amazon Web Services failed, taking down with it a slew of startups that all run on Amazon’s Cloud infrastructure. AirBNB was one of the biggest, but also Heroku, Reddit, Minecraft, Flipboard & Coursera down with it. Its not the first time. What the heck happened, and why should we care?

1. Root Cause

The AWS service allows companies like AirBNB to build web applications, and host them on servers owned and managed by Amazon. The so-called raw iron of this army of compute power sits in datacenters. Each datacenter is a zone, and there are many in each of their service regions including US East (Northern Virginia), US West (Oregon), US West (Northern California), EU (Ireland), Asia Pacific (Singapore), Asia Pacific (Tokyo), South America (Sao Paulo), and AWS GovCloud.

Today one of those datacenters in the Northern Virginia region had a failure. What does this mean? Essentially firms like AirBNB that hosted their applications ONLY in Northern Virginia experienced outages.

As it turns out, Amazon has a service level agreement of 99.95% availability. We’ve long since said goodbye to the five nines. HA is overrated.

2. Use Redundancy

Although there are lots of pieces and components to a web infrastructure, two big ones are webservers and database servers. Turns out AirBNB could make both of these tiers redundant. How do we do it?

On the database side, you can use Amazon’s multi-az or alternately read-replicas. Each have different service characteristics so you’ll have to evaluate your application to figure out what will work for you.

Then there is the option to host mysql or Percona directly on Amazon servers yourself and use replication.

[quote]Using redundant components like placing webservers and databases in multiple regions, AirBNB could avoid an Amazon outage like Monday’s that affected only Northern Virginia.[/quote]
When do I want RDS versus mysql? Here are some use cases for RDS versus roll your own MySQL.

Now that you’re using multiple zones and regions for your database the hard work is completed. Webservers can be hosted in different regions easily, and don’t require complicated replication to do it.

3. Have a browsing only mode

Another step AirBNB can take to be resilient is to build a browsing only mode into their application. Often we hear about this option for performing maintenance without downtime. But it’s even more valuable during a situation like this. In a real outage you don’t have control over how long it lasts or WHEN it happens. So a browsing only mode can provide real insurance.

For a site like AirBNB this would mean the entire website was up and operating. Customers could browse and view listings, only when they went to book a room would the encounter an error. This would be a very small segment of their customers, and a much less painful PR problem.

Facebook has experience intermittent outages of it’s service. People hardly notice because they’ll often only see a message when they are trying to comment on someone’s wall post, send a message or upload a photo. The site is still operating, but not allowing changes. That’s what a browsing only mode affords you.

[quote]A browsing only mode can make a big difference, keeping most of the site up even when transactions or publish are blocked.
[/quote]

Drupal, an open source CMS system that powers sites like Adweek.com, TheHollywoodReporter.com, and Economist.com uses this technology. It supports a browsing only mode out of the box. An amazon outage like this one would only stop editors from publishing new stories temporarily. A huge win to sites that get 50 to 100 million with-an-m pageviews per month.

4. Web Applications need Feature Flags

Feature flags give you an on/off switch. Build them into heavy duty parts of your site, and you can disable those in an emergency. Host components multiple availability zones for extra peace of mind.

One of our all time most popular posts 5 Things Toxic to Scalability included some indepth discussion of feature flags.

5. Consider Netflix’s Simian Army

Netflix takes a very progressive approach to availability. They bake redundancy and automation right into all of their infrastructure. Then they run an app called the Chaos Monkey which essentially causes outages, randomly. If resilience from constantly falling and getting back up can’t make you stronger, I don’t know what can!

Take a look at the Netflix blog for details on intentional load & stress testing.

6. Use multiple cloud providers

If all of the above isn’t enough for you, taking it further you’d do as George Reese of enstratus recommends and use multiple cloud providers. Not being beholden to one company could help in more situations than just these type of service disruptions too.

Basic EC2 Best Practices mean building redundancy into your infrastructure. Multiple cloud providers simply take that one step further.

Read this far? Grab our newsletter on scalability and startups!