Does AWS have a dirty little secret?

tell a secret

I was recently talking with a colleague of mine about where AWS is today. Obviously there companies are migrating to EC2 & the cloud rapidly. The growth rates are staggering.

Join 32,000 others and follow Sean Hull on twitter @hullsean.

The question was…

“What’s good and bad with Amazon today?”

It’s an interesting question. I think there some dirty little secrets here, but also some very surprising bright spots. This is my take.

1. VPC is not well understood  (FAIL)

This is the biggest one in my mind.  Amazon’s security model is all new to traditional ops folks.  Many customers I see deploy in “classic EC2”.  Other’s deploy haphazerdly in their own VPC, without a clear plan.

The best practices is to have one or more VPCs, with private & public subnet.  Put databases in private, webservers in public.  Then create a jump box in the public subnet, and funnel all ssh connections through there, allow any source IP, use users for authentication & auditing (only on this box), then use google-authenticator for 2factor at the command line.  It also provides an easy way to decommission accounts, and lock out users who leave the company.

However most customers have done little of this, or a mixture but not all of it.  So GETTING TO BEST PRACTICES around vpc, would mean deploying a vpc as described, then moving each and every one of your boxes & services over there.  Imagine the risk to production services.  Imagine the chances of error, even if you’re using Chef or your own standardized AMIs.

Also: Are we fast approaching cloud-mageddon?

2. Feature fatigue (FAIL)

Another problem is a sort of “paradox of choice”.  That is that Amazon is releasing so many new offerings so quickly, few engineers know it all.  So you find a lot of shops implementing things wrong because they didn’t understand a feature.  In other words AWS already solved the problem.

OpenRoad comes to mind.  They’ve got media files on the filesystem, when S3 is plainly Amazon’s purpose-built service for this.  

Is AWS too complex for small dev teams & startups?

Related: Does Amazon eat it’s own dogfood? Apparently yes!

3. Required redundancy & automation  (FAIL)

The model here is what Netflix has done with ChaosMonkey.  They literally knock machines offline to test their setup.  The problem is detected, and new hardware brought online automatically.  Deploying across AZs is another example.  As Amazon says, we give you the tools, it’s up to you to implement the resiliency.

But few firms do this.  They’re deployed on Amazon as if it’s a traditional hosting platform.  So they’re at risk in various ways.  Of Amazon outages.  Of hardware problems under the VMs.  Of EBS network issues, of localized outages, etc.

Read: Is Amazon too big to fail?

4. Lambda  (WIN)

I went to the serverless conference a week ago.  It was exiting to see what is happening.  It is truely the *bleeding edge* of cloud.  IBM & Azure & Google all have a serverless offering now.  

The potential here is huge.  Eliminating *ALL* of the server management headaches, from packages to config management & scaling, hiding all of that could have a huge upside.  What’s more it takes the on-demand model even further.  YOu have no compute running idle until you hit an endpoint.  Cost savings could be huge.  Wonder if it has the potential to cannibalize Amazon’s own EC2 …  we’ll see.

Charity Majors wrote a very good critical piece – WTF is Operations? #serverless
WTF is operations? #serverless

Patrick Dubois 

Also: Is the difference between dev & ops a four-letter word?

5. Redshift  (WIN)

Seems like *everybody* is deploying a data warehouse on Redshift these days.  It’s no wonder, because they already have their transactional database, their web backend on RDS of some kind.  So it makes sense that Amazon would build an offering for reporting.

I’ve heard customers rave about reports that took 10 hours on MySQL run in under a minute on Redshift.  It’s not surprising because MySQL wasn’t built for the size servers it’s being deployed on today.  So it doesn’t make good use of all that memory.  Even with SSD drives, query plans can execute badly.

Also: Is there a better way to build a warehouse in 2016?

Get more. Grab our exclusive monthly Scalable Startups. We share tips and special content. Our latest Why I don’t work with recruiters


Also published on Medium.

  • lawrencekrubner

    I really should have included this quote in my essay about AWS requiring a specialist: “Another problem is a sort of “paradox of choice”. That is that Amazon
    is releasing so many new offerings so quickly, few engineers know it
    all.”

    That is another aspect of it. Again, AWS becomes appropriate as soon as an organization is large enough that it can hire a devops specialist such as yourself, Sean. But until that moment, there are simpler services that are easier for engineers like me, who know basic Linux devops, but who don’t know all the details of AWS (and who don’t have time to keep up with AWS, because our primary responsibility is to keep up with application specific libraries such as Immutant, Datomic, Automat, Manifold, Elixir, Phoenix, Sidekiq, Aleph, etc.

  • http://www.iheavy.com/blog/ Sean Hull

    “Hi :) just wanted to say i read your blog all the time and you have some interesting stories and battle scars and have seen a lot of production fires that you know your stuff.

    One thing I had a question is this statement you made regarding VPC’s

    The best practices is to have one or more VPCs, with private & public subnet. Put databases in private, webservers in public. Then create a jump box in the public subnet, and funnel all ssh connections through there, allow any source IP, use users for authentication & auditing (only on this box), then use google-authenticator for 2factor at the command line. It also provides an easy way to decommission accounts, and lock out users who leave the company.

    You mention that you should put web servers on public. Actually a recommendation i make is to always use web servers and application servers on a private subnet

    Then setup public load balancers on the public subnet and point them to your web servers or application servers that live on the private subnets. I think maybe that would be a better recommendation. A lot of people don’t understand security groups and end up exposing their web servers because you know the developers are probably going to open ssh to ssh into those servers for debugging. Also jump boxes are a pain because you have to tunnel ssh connections all the time . What I do is setup Openvpn and then have openvpn push the routes that I want the vpn clients to access which is mostly the internal application servers .

    Curious if my approach makes sense vs your recommendations.

    Thanks in advance.

    Marco Maldonado”

    • http://www.iheavy.com/blog/ Sean Hull

      Hi Marco, thx for your email!

      As I recall, reading all the AWS docs & A Cloud Guru lectures on the topic, they *do* recommend putting the webservers in the public subnet, but they don’t give them their own IP addresses. I’m not sure why they recommend this, because as you say, only the load balancer remains exposed.

      To be fair, networking has never been my strong suit. So your advice may be ideal. :)

      As far as jump box, one cool reason to do that is when you have outgoing employees. *EACH* user has their own login on the jump box. Then key-based logins are setup to each of the internal boxes. You solve two birds with one stone. You have auditing because everyone comes through jump. So you know when people are logged in and where. But when you then wanna lock them out, nobody has the private key for the individual boxes.

      I’m going to add this to the disqus comments.

      -Sean