Categories
All Security

Does Amazon’s security work well for startups?

via GIPHY

I was sifting through my project & progress reports from former clients today. Something struck me loud and clear. It seems 4 out of 5 of them don’t implement VPC best practices.

Join 35,000 others and follow Sean Hull on twitter @hullsean.

Which begs the question again and again, is the service just too damn complicated? I wrote about this topic before… Is aws a bit too complex for most or at least smaller dev teams?

1. No private subnets

What are those you ask? I really hope you’re not asking that.

The best practices way to deploy on amazon is using a vpc. This provides a logical grouping. You could have a dev, stage and prod vpc, and perhaps a utility one for other more permanent services.

Within that VPC, you want to have everything deployed in one or more private subnets. These are each mapped to a specific AZ in that region. The AZ mapps to a physical datacenter, a single building within that region. These private subnets have *NO route to the internet*.

How do you reach resources in the private subnet? You must be coming from the public subnet deployed within that same VPC. All the routing rules enforce this. The two types of resources that would be deployed in public subnet: load balancer for 80/443 traffic, and a jump or bastion box for ssh.

Read: How can 1% of something equal nothing?

2. Security groups with all ports open

Another thing that I see more often than you might guess is all ports open by some wildcard rule. *BAD*. We all know it’s bad, but it happens. And then it gets forgotten. We see developers doing it as a temporary fix to get something working and forget to later plug up the hole.

Even for security groups that don’t have this problem, they often allow port 22 from anywhere on the internet (0.0.0.0). This is unnecessary and rather reckless. Everyone should be coming from known source IPs. This can be an office network, or it can be some other trusted server on the internet. Or a block of IPs that you’ll always have assigned.

And of course don’t have your database port open. MySQL and Postgres don’t have particularly great protections here.

Related: Is Fred Wilson right about dealing in an honest, direct and transparent way?

3. No flowlogs enabled

Flowlogs allow you to log things at the packet level. Want to know about failed ssh attempts? Log that. What to know about other ports? Log that too.

If you are funneling all your connections through a jump box, then you can just enable flowlogs then you can configure your vpc flowlogs monitoring just for that box itself. You may also want to watch what’s happening with the load balancer too.

Flowlogs work at the network interface layer of your VPC, so you’ll need to understand VPCs in depth.

Related: What mistakes did you make when starting as a consultant?

Get more. Grab our exclusive monthly Scalable Startups. We share tips and special content. Our latest Why I don’t work with recruiters

Categories
All Consulting CTO/CIO Security

How do we secure an existing aws hosted application?

via GIPHY

What if you don’t have the luxury of a greenfield. You are looking at an already built application, and asking yourself, how do I secure this?

Join 35,000 others and follow Sean Hull on twitter @hullsean.

One can think of it as a giant labyrinth, with many turns and many paths. Some of those paths have not had light shining in them for some time. So you’ll need to be cautious, thorough, and vigilant.

Here are some notes on where to start.

1. Scanning – code

One area you’ll need to dig into is the application code itself. If you don’t have the luxury to push new code, you’ll need to verify what version is deployed, and scan the repository for keys or passwords. You can also scan on the server itself. Better to double your efforts.

Read: What do the best engineers do better?

2. Scanning – network

Your VPC is obviously your first layer of defense. Scan the routing table policies, to make sure there aren’t open ports or whitelisted IPs.

Do the same sort of review for security groups, as those are an alternative method for configuring access to servers.

AWS has a service called Flowlogs, which can be enabled. These give you detailed network layer logging, which you can then scan for trouble.

Related: Is Fred Wilson right about dealing in an honest, direct and transparent way?

3. Scanning – IAM, keys & console

Your existing devs probably have keys to some or all of the EC2 boxes. If you don’t want to relaunch all of these boxes with new keys, or don’t have the luxury to do that, you’ll need to lock down the security groups, whitelisted IPs and VPC routing rules.

You’ll also need to carefully review IAM roles & policies. Amazon Inspector may be a useful tool to scan your environment, and find glaring holes and enforce best practices. But you’ll also want to do your own scanning both automated and manually eyeballing the accounts.

You’ll also want to lock down console access, especially the root account, and any others that have adminstrator privileges. Enable password policies and password rotation, as well as multi-factor authentication. There is also a nice toggle for “alert on login”. You certainly want to know about those!

Related: What mistakes did you make when starting as a consultant?

4. Scanning – services

Review all of the AWS services that are deployed. Ask yourself some of these questions:

o which regions & availability zones am I deployed in?
o what elastic IPs do I have configured where?
o what IAM roles & policies do I have created?
o what databases, API gateways & S3 buckets are configured
o etc…

Cloudtrail can be a great help here as it can log all sorts of useful information. You can then scan those logs for problems.

Related: Why did mailchimp fraudulently charge my credit card?

5. Rebuilding

The scanning approach can work, but there is a strong need to be thorough. If you miss one whitelisted IP or existing ssh key, you can leave the whole network open to a crafty intruder.

Another option is to rebuild the whole application. This gives you the time to:

o automate the whole stack with terraform
o test that everything is working
o plan for failover
o ensure that every bucket is secure with lifecycle policies enabled
o ensure that every EBS volume is encrypted
o enabled cloudtrail, cloudwatch etc

o potentially setup in a *brand new* aws account, for even more confidence
o backup all the pieces of the application as you go

Read: Did Disney+ have to fail?

Get more. Grab our exclusive monthly Scalable Startups. We share tips and special content. Our latest Why I don’t work with recruiters

Categories
All Devops

When I found gold in my customer archives

via GIPHY

I’m good at keeping notes. I’ve blogged about Can progress reports & daily notes help engagements succeed. I would give that an emphatic YES!

From helping with communication, to sharing arcane details about blocking issues, struggles & hurdles, notes can illuminate things that a CTO or manager may not otherwise be aware of.

Join 35,000 others and follow Sean Hull on twitter @hullsean.

I was digging through mine archives recently, and found a bunch of notes on how to think about Terraform. In particular, how do you think about infrastructure as code? How do you architect to make it all work together?

1. A dead end started me backtracking

You’re going to dig your heels in by getting your application working. To do that you’ll spinup a vpn, private public subnet, bastion boxes, ECS hosts to deploy containers to, and an application load balancer endpoint. Getting that all working wasn’t terrible. We even included a prometheus node, to give us some monitoring visibility. We even added our jenkins server into the mix. Do you see where this is going?

At a certain point we of course needed to destroy the whole setup, but didn’t want to destroy the CI pipeline. Duh! And what about monitoring? Lose all that data each time no way!

Read: Infrastructure provisioning – what is it and why is it important?

2. Organize around VPCs

After dragging yourself through that, you see a bit better. It’s like standing at 20,000 feet.

Your vpn is a logical collection of instances. A box that holds your application, provides security, and gets created and destroyed with it. You can even see in your Terraform code, a subnet requires a VPN id within which to create it. And an instance requires a subnet within which to create it. For security reasons the application instances will sit within PRIVATE subnets, and only bastion box & load balancers will sit in PUBLIC subnets.

TO my mind that means each environment DEV, STAGE, PROD all get their own vpn. This also allows you to control who can access stage & production, as they have their own bastion access points.

Read: How to hire a developer that doesn’t suck

3. Build a utility VPC

What you’ll also see from the above story is that you need a place to have business wide, non-application services sit inside. Welcome the UTILITY VPC!

This can contain prometheus, ELK or other log collection service, your jenkins or other CI pipeline, and any other services that don’t logically fit within the application VPC.

Related: Why generalists are better at scaling the web

4. A VPC should be ok to destroy and rebuild in another region – in one-click

When you use infrastructure code, you want to test, create & destroy often. That shouldn’t disrupt anything. That means all state data should sit outside of those instances. Logging data, send it to logstash or cloudwatch. Application state, keep that inside of an RDS instance. And you’ve tested those backups right?

Speaking of RDS, I encountered problems with Amazon’s own backup & restore. For my money, I had a lot of problems and ended up writing a custom db dump script. That may require a custom restore to, so buyer beware. Here’s my story though… I tried to build infrastructure as code with Terraform and Amazon and it didn’t go as i expected.

You also may encounter issues when you move across regions, such as elastic IPs and so forth. And you’ll need to check and verify the code which creates and destroy S3 buckets and domain name certs. These areas gave me some hiccups, but you can work through it with diligence!

Read: Is zero downtime even possible on RDS?

Get more. Grab our exclusive monthly Scalable Startups. We share tips and special content. Our latest Why I don’t work with recruiters

Categories
All Cloud Computing Consulting CTO/CIO Devops

Viktor Farcic Interview excerpts

via GIPHY

I recently did an interview with Viktor Farcic all about operations, DBA & Devops. Here are some excerpts: What does Dev-Ops mean?

Join 35,000 others and follow Sean Hull on twitter @hullsean.

Continuing where I left off, I’ve included a few more highlights below. Enjoy!

1. Can I use a tool to migrate to the public cloud?

Viktor Farcic: I’ve seen quite a few of these tools that tell you if you buy our tool, we’re going to transfer whatever you have to the cloud. For example, Docker announced in the last DockerCon that they’re going to put in containers without a single change and everything will work. What do you think about that?

Sean Hull Salespeople often simplify things quite a bit in order to sell a product; in my experience, the devil is in the detail. It’s not to say that an automation tool like that might not be valuable and useful. It might be a good first step to getting your application in the cloud, and it might be an easier way than to rebuild everything one by one. But I doubt that it’s going to work magically just by one script.
EC2 instances, for example, have different performance characteristics, not only in terms of the disk I/O, memory, and CPU, but in smaller instances, they actually throttle the network access so you might spin up an instance and it just might not behave well. It might take time. In fact, all sorts of things could happen. You might have written MySQL scripts that assume you have root access to the server and then you rebuild that in an RDS and you get errors because you don’t have access to those resources on RDS. There’s a lot of things to consider.

Read: What happened when I offered advice outside my pay grade?

2. How do you adapt to change?

Viktor Farcic: I have the impression that the speed with which new things are coming is only increasing. How do you keep up with it, and how do companies you work with keep up with all that?

Sean Hull: I don’t think they do keep up. I’ve gone to a lot of companies where they’ve never used serverless. None of their engineers know serverless at all. Lambda, web tasks, and Google Cloud functions have been out for a while, but I think there are very few companies that are able to really take advantage of them. I wrote another article blog post called Is Amazon Web Services Too Complex for Small Dev Teams? where I sort of implied that it is.
I do find a lot of companies want the advantage of on-demand computing, but they really don’t have the in-house expertise yet to really take advantage of all the things that Amazon can do and offer. That’s exactly why people aren’t up to speed on the technology, as it’s just changing so quickly. I’m not sure what the answer is. For me personally, there’s definitely a lot of stuff that I don’t know. I know I’m stronger in Python than I am with Node.js. Some companies have Node.js, and you can write Lambda functions in Java, Node.js, Python, and Go. So, I think Amazon’s investment in new technology allows the platform to evolve faster than a lot of companies are able to really take advantage of it.
Read: What did Matt Ranney discover scaling Uber to 1000 microservices?

3. What is the future of Devops?

Viktor Farcic: I’m going to ask you a question now that I hate being asked, so you’re allowed not to answer. Where do you see the future, let’s say a year from now?

Sean Hull:
I see more fragmentation happening across the technology landscape, and I think that that is ultimately making things more fragile because, for example, with microservices, companies don’t think twice about having Ruby, Python, Node.js, and Java. They have 10 different stacks, so when you hire new people, either you have to ask them to learn all those stacks or you have to hire people with each of those individual areas of expertise. The same is true with all these different clouds with their own sets of features: there’s a fragmentation happening.
Let’s look at the iPhone as an example. Think about how complex application testing is for Android versus the iPhone. I mean, you have hundreds of different smartphones that run Android, all with different screen sizes, different hardware, different amounts of memory, and the underlying stuff. Some may even have some extra chips that others don’t have, so how do you test your application across all those different platforms?
When you have fragmentation like that, it means the applications end up not working as well. I think the same thing is happening across the technology spectrum today that happened 10 to 15 years ago, where for your database backend there was Oracle, SQL Server, MySQL, and Postgres. Maybe somebody who’s a DB2 enterprise customer uses DB2, but now there are hundreds of open source databases, graph databases, and DynamoDB versus Cassandra, and so on and so on. There’s no real deep expertise in any of those databases.
What ends up happening is you have cases like what happened with customers who were using MongoDB. They found out the hard way about all of the weird behaviors and performance problems it had, because there just weren’t people around with deep knowledge of what was happening behind the scenes, whereas in Oracle’s space, for example, there are career DBAs that are performance experts that specialize in Oracle internals, so you can hire somebody to solve particular problems in that space.
There aren’t, as far as I know, a lot of people with MongoDB internals expertise. You’d have to call MongoDB themselves; maybe they have a few engineers that they can send out, so what’s the future? I see a lot of fragmentation and complexity, and that makes the internet and internet applications more fragile, more brittle, and more prone to failure.

Related: Can humility help you in your career?

Get more. Grab our exclusive monthly Scalable Startups. We share tips and special content. Our latest Why I don’t work with recruiters

Categories
Cloud Computing Cloud Migrations CTO/CIO

Should we right size instances in the public cloud?

via GIPHY

I’m a big fan of Corey Quinn’s Last Week in AWS newsletter.

Recently he wrote a piece titled Right Sizing your instances is nonesense

Not to be outdone, a blogger Joe at Sunshower.io wrote a counterpoint piece Why Right Sizing instances is not nonsense.

Join 35,000 others and follow Sean Hull on twitter @hullsean.

So what’s the verdict here? Is Corey wrong and Joe right? Or is Joe right and Corey wrong?

I would argue it depends. Corey’s piece emphasizes the big picture, essentially that technical changes can buy more trouble than the money they save. While

1. Corey emphasizes the 300 foot view

Corey’s article uses a broad brush, and in doing so some of his specifics are incorrect. For example the point about older versions of Operating Systems and hypervisor. In most cases your OS won’t know it’s running on a hypbervisor at all, so this seems a very rare edge case indeed.

That said his big picture conclusion seems spot on. Changing instance sizes can be a huge risk if you don’t do so regularly. With legacy apps, who knows how they might behave. In this you carry the same burden as changing instances at all. Your AMI may not be ready, or you may have had some manual steps to build the box again.

Related: How can we keep cloud architectures simple

2. Joe gets down in the weeds of technical specifics

The first thing about Joe’s post that caught my attention was to use the console to change the instance size. You shouldn’t be manually changing instances through the console to start with. What, everything is code, all the way down? Hehe…

Also his points about cost savings seemed cherry picked. If you can save that much money by changing instance sizes, it is well worth the cost to do regression, integration & disaster recovery tests to make sure it will all work. Get on it!

Also: What hidden things does a deposit reveal?

3. Use infrastructure as code, test & retest it

IAC is really the way to go. But that still doesn’t mean you’re out of the woods. Be cautious when changing instance sizes!

With code, there is testing. So change your Terraform code and then verify it works. Now just because you have a variable indicating the instance size, doesn’t mean changing it won’t break something.

Read: Can communication mixups sour an engagement?

4. Weigh the cost savings with the risk of breaking things

Changing an instance size and redeploying could break all manner of things. It’s possible you used a variable for the instance size in some places and hard coded it in others. Or made some weird reference in an autoscaling group.

It may be the AMI you’ve built works on one type of instance but not another. Or that your AMI is deployed in one region but not another. Or that your old instance size is available in us-east-2, but your new instance size is not yet available there. Yes the console wouldn’t have offered it, but your Terraform code didn’t know.

Check out: How I use 5 daily habits to stay on track

5. Put up some guiderails

As you’ll see further down in the comments, Joe suggests to put limits around instance size changes. That makes sense. After you’ve done testing, you’ll have an idea of what these limits should be. No instance size less than 30GB memory? No instance size greater than X. None in region Y. Etc.

It’ll require tweaking your terraform infrastructure code, so it’s not a free change. But it’ll pay dividends if your costs savings are in the thousands.

Also: Can daily notes help you work better with clients?

Get more. Grab our exclusive monthly Scalable Startups. We share tips and special content. Our latest Why I don’t work with recruiters

Categories
All Cloud Computing CTO/CIO High Availability Startups

Are you as good as the public cloud?

via GIPHY

According to Lyft’s recent public filing, they plan to spend 300 million buckaroos in the next 2.5 years on AWS.

Did I hear that right?

Join 38,000 others and follow Sean Hull on twitter @hullsean.

Perhaps that is their estimate, or the maximum amount they want to budget for. Regardless that’s a lot of money any way you slice it. A lot of folks are commenting about how crazy that is, and how much datacenter you could build yourself with that much money.

What do you think? Is it foolhardy? Or is there a hidden wisdom here?

Here’s my take.

1. Do you have one million customers testing your datacenter?

If you’re comparing the cost of the cloud to the raw numbers of running your own datacenter, the hardware costs are not enough. You’ll need to include the ops teams & other engineers. Right, you probably guessed that.

But did you factor in the costs of a legion of testers. This is the hidden cost that commercial software carries, even while open source software gets this benefit for free.

With a public cloud like AWS you have millions of customers testing the product everyday, and running into edge cases long before you do. So you get a better service, that’s more reliable, all invisibly for free.

Related: How can we keep cloud architectures simple

2. Do you have 66 datacenters spread across 21 regions and a free network between them?

Anybody who was building web applications in the year 2000 will remember how websites didn’t load the same for different customers. Depending on where in the world they were located, they could experience a very different user experience.

These days we assume that we can be global from day one. But how exactly do we achieve this? Remember with a public cloud, you’re getting tons of things for free, without knowing it. Moving data between AZs or regions? That’s all going across a private interconnect.

And that’s not even including the 180 nodes inside cloudfront that give you a global CDN footprint too!

Also: What hidden things does a deposit reveal?

3. Do you have an engineering team automating away job roles?

I remember the days of DBA job role, do you? Probably not. I specialized in this for years, and there were tons of companies hiring me to help them with it. First Oracle, then MySQL, then Postgres.

Then along came Amazon RDS. Guess what, companies don’t really hire for that role anymore. They do need help with it from time to time, but not as a primary specialization.

What do I mean? Well by hosting your application on AWS, you’re benefiting from the work of teams of engineers in different departments, all expanding on APIs and automating things that those one million customers are asking for.

You’re not going to be able to innovate that well and that quickly in your own datacenter. So you’ll pay more!

Read: Can communication mixups sour an engagement?

4. Do you have APIs that tons of engineers have already written code for?

A quick peek at Terraform’s community modules on Github and you’ll probably blush. From VPCs to bastion boxes, key management to load balancers, lots of code has been written and open sourced.

By deploying on a platform that a lot of other devs are using, you’ll benefit from all this open source code. That means you won’t have to write that stuff yourself.

Sure you’ll have integration work to do, but the hidden benefit of being on a popular platform saves you money.

Check out: How I use 5 daily habits to stay on track

5. Can you do disaster recovery for free?

If you build your own datacenter, you have to buy all your capacity. So there are no spare servers sitting around waiting for your use. In the public cloud there is always spare capacity.

What that means is you can write automation code to spinup copies of your application stack in alternate regions, at the push of a button. Thus you effectively get disaster recovery for free!

Also: Can daily notes help you work better with clients?

Get more. Grab our exclusive monthly Scalable Startups. We share tips and special content. Our latest Why I don’t work with recruiters

Categories
All Cloud Computing Devops

I tried to understand Amazon EKS internals and here’s what happened

via GIPHY

EKS is a service to run kubernetes, so you don’t have to install the software, or manage or patch it. Just like GKS on Google, kubernetes as a service is really the way to go if you want to build kubernetes apps on AWS.

Join 38,000 others and follow Sean Hull on twitter @hullsean.

So where do we get started? AWS docs are still coming together, so it’s not easy. I would start with Jerry Hargrove’s amazing EKS diagram. If a picture is worth a thousand words, this one is work 10,000!

1. Build your EKS cluster

I already did this in Terraform. There aren’t a lot of howtos, so I wrote one.

Basically you setup the service role, the cluster, then the worker nodes. Once you’ve done that you’re ready to run the demo app.

Related: When you have to take the fall

2. Build your app spec

These are very similar to ECS tasks. You’ll need to make slight changes. mountPoints become VolumeMounts, links get removed, and workingDirectory becomes workingDir and so on. Most of these changes are obvious, but the json syntax is obviously the biggest bear you’ll wrestle with.

When done do this:

$ kubectl apply -f my-controller.json

Related: When clients don’t pay

3. Build the service spec

The service is quite a bit different than an ECS service. I suggest starting from the guestbook service. Find it here

Edit that and add your own app name & details. Then apply:

$ kubectl apply -f my-service.json

Related: Why i ask for a deposit

4. Get the endpoint and go!

$ kubectl get service -o wide

You should see the EXTERNAL-IP display a loadbalancer endpoint. Copy that into your browser and you should see your app running.

Related: Why i ask for a deposit

Get more. Grab our exclusive monthly Scalable Startups. We share tips and special content. Our latest Why I don’t work with recruiters

Categories
All Cloud Computing Cloud Migrations Devops

How do we migrate our business to the public cloud?

via GIPHY

The public cloud is no longer a bleeding edge technology for the trailblazers. It’s mainstream now. As you think about it, you consider your customers and the SLAs they’ve come to expect.

Join 38,000 others and follow Sean Hull on twitter @hullsean.

It’s not if, but when to move to the cloud, how to get there, and how fast will be the transition?

Here are my thoughts on what to start thinking about.

1. Ramp up team, skills & paradigm thinking

Teams with experience in traditional datacenters have certain ways of architecting solutions, and thinking about problems. For example they may choose NFS servers to host objects, where in the cloud you will use object storage such as S3.

S3 has all sorts of new features, like lifecycle policies, and super super redundant eleven 9’s of durability. But your applications may need to be retrofitted to work with it, and your devs may need to learn about new features and functionality.

What about networking? This changes a lot in the cloud, with VPCs, and virtual appliances like NATs and Gateways. And what about security groups?

Interacting with this new world of cloud resources, requires new skillsets and new ways of thinking. So priority one will be getting your engineering teams learning, and upgrading skills. I wrote a piece about this how do I migrate my skills to the cloud?

Related: When you have to take the fall

2. Adapt to a new security model

With the old style datacenter, you typically have a firewall, and everything gets blocked & controlled. The new world of cloud computing uses security groups. These can be applied at the network level, across your VPC, or at the server level. And of course you can have many security groups with overlapping jurisdictions. Here’s how you setup a VPC with Terraform

So understanding how things work in the public cloud is quite new and challenging. There are ingress and egress rules, ways to audit with network flow logs, and more.

However again, it’s one thing to have the features available, it’s quite another to put them to proper use.

Related: When clients don’t pay

3. Adapt to fragile components & networks

While the public cloud collectively is extremely resilient, the individual components such as EC2 instances are decidedly not reliable. It’s expected that they can and will die frequently. It’s your job as the customer to build things in a self-healing way.

That means VPCs with multiple subnets, across availability zones (multi-az). And that means redundant instances for everything. What’s more you front your servers with load balancers (classic or application). These themselves are redundant.

Whether you are building a containerized application and deploying on ECS or a traditional auto-scaling webserver with database backend, you’ll need to plan for failure. And that means code that detects, and reacts to such failures without downtime to the end user.

Related: Why i ask for a deposit

4. Build infrastructure as code

You’ve heard about devops, now it’s time to put it into practice. Building your complete stack in code, is very possible with tools like Terraform. But you may have trouble along the way. I wrote I tried to write infra as code with Terraform and AWS and it didn’t go as expected

So there’s a learning curve. Both for your operations teams who have previously called Rackspace to get a new server provisioned. And also for your business, learning what incurs an outage, and the tricky finicky sides to managing your public cloud through code.

Related: Why i ask for a deposit

5. Audit, log & monitor

As you automate more and more pieces, you may have less confidence in the overall scope of your deployments. How many servers am I using right now? How many S3 buckets? What about elastic IPs?

As your automation can itself spinup new temporary environments, those resource counts will change from moment to moment. Even a spike in user engagement or a sudden flash sale, can change your cloud footprint in an instant.

That’s where heavy use of logging such as ELK (elasticsearch, logstash and kibana) can really help. Sure AWS offers CloudWatch and CloudTrail, but again you must put it all to good use.

Related: Why i ask for a deposit

Get more. Grab our exclusive monthly Scalable Startups. We share tips and special content. Our latest Why I don’t work with recruiters

Categories
All Devops

How to setup an Amazon EKS demo with Terraform

via GIPHY

Since EKS is pretty new, there aren’t a lot of howtos on it yet.

I wanted to follow along with Amazon’s Getting started with EKS & Kubernetes Guide.

However I didn’t want to use cloudformation. We all know Terraform is far superior!

Join 38,000 others and follow Sean Hull on twitter @hullsean.

With that I went to work getting it going. And a learned a few lessons along the way.

My steps follow pretty closely with the Amazon guide above, and setting up the guestbook app. The only big difference is I’m using Terraform.

1. create the EKS service role

Create a file called eks-iam-role.tf and add the following:

resource "aws_iam_role" "demo-cluster" {
  name = "terraform-eks-demo-cluster"

  assume_role_policy = --POLICY
{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Principal": {
        "Service": "eks.amazonaws.com"
      },
      "Action": "sts:AssumeRole"
    }
  ]
}
POLICY
}

resource "aws_iam_role_policy_attachment" "demo-cluster-AmazonEKSClusterPolicy" {
  policy_arn = "arn:aws:iam::aws:policy/AmazonEKSClusterPolicy"
  role       = "${aws_iam_role.demo-cluster.name}"
}

resource "aws_iam_role_policy_attachment" "demo-cluster-AmazonEKSServicePolicy" {
  policy_arn = "arn:aws:iam::aws:policy/AmazonEKSServicePolicy"
  role       = "${aws_iam_role.demo-cluster.name}"
}

Note we reference demo-cluster resource. We define that in step #3 below.

Related: How to setup Amazon ECS with Terraform

2. Create the EKS vpc

Here’s the code to create the VPC. I’m using the Terraform community module to do this.

There are two things to notice here. One is I reference eks-region variable. Add this in your vars.tf. “us-east-1” or whatever you like. Also add cluster-name to your vars.tf.

Also notice the special tags. Those are super important. If you don’t tag your resources properly, kubernetes won’t be able to do it’s thing. Or rather EKS won’t. I had this problem early on and it is very hard to diagnose. The tags in this vpc module, with propagate to subnets, and security groups which is also crucial.

#
provider "aws" {
  region = "${var.eks-region}"
}

#
module "eks-vpc" {
  source = "terraform-aws-modules/vpc/aws"

  name = "eks-vpc"
  cidr = "10.0.0.0/16"

  azs             = "${var.eks-azs}"
  private_subnets = "${var.eks-private-cidrs}"
  public_subnets  = "${var.eks-public-cidrs}"

  enable_nat_gateway = false
  single_nat_gateway = true

  #  reuse_nat_ips        = "${var.eks-reuse-eip}"
  enable_vpn_gateway = false

  #  external_nat_ip_ids  = ["${var.eks-nat-fixed-eip}"]
  enable_dns_hostnames = true

  tags = {
    Terraform                                   = "true"
    Environment                                 = "${var.environment_name}"
    "kubernetes.io/cluster/${var.cluster-name}" = "shared"
  }
}

resource "aws_security_group_rule" "allow_http" {
  type              = "ingress"
  from_port         = 80
  to_port           = 80
  protocol          = "TCP"
  security_group_id = "${module.eks-vpc.default_security_group_id}"
  cidr_blocks       = ["0.0.0.0/0"]
}

resource "aws_security_group_rule" "allow_guestbook" {
  type              = "ingress"
  from_port         = 3000
  to_port           = 3000
  protocol          = "TCP"
  security_group_id = "${module.eks-vpc.default_security_group_id}"
  cidr_blocks       = ["0.0.0.0/0"]
}

Related: How I resolved some tough Docker problems when i was troubleshooting amazon ECS

3. Create the EKS Cluster

Creating the cluster is a short bit of terraform code below. The aws_eks_cluster resource.

#
# main EKS terraform resource definition
#
resource "aws_eks_cluster" "eks-cluster" {
  name = "${var.cluster-name}"

  role_arn = "${aws_iam_role.demo-cluster.arn}"

  vpc_config {
    subnet_ids = ["${module.eks-vpc.public_subnets}"]
  }
}

output "endpoint" {
  value = "${aws_eks_cluster.eks-cluster.endpoint}"
}

output "kubeconfig-certificate-authority-data" {
  value = "${aws_eks_cluster.eks-cluster.certificate_authority.0.data}"
}

Related: Is Amazon too big to fail?

4. Install & configure kubectl

The AWS docs are pretty good on this point.

First you need to install the client on your local desktop. For me i used brew install, the mac osx package manager. You’ll also need the heptio-authenticator-aws binary. Again refer to the aws docs for help on this.

The main piece you will add is a directory (~/.kube) and edit this file ~/.kube/config as follows:

apiVersion: v1
clusters:
- cluster:
    server: https://3A3C22EEF7477792E917CB0118DD3X22.yl4.us-east-1.eks.amazonaws.com
    certificate-authority-data: "a-really-really-long-string-of-characters"
  name: kubernetes
contexts:
- context:
    cluster: kubernetes
    user: aws
  name: aws
current-context: aws
kind: Config
preferences: {}
users:
- name: aws
  user:
    exec:
      apiVersion: client.authentication.k8s.io/v1alpha1
      command: heptio-authenticator-aws
      args:
        - "token"
        - "-i"
        - "sean-eks"
      #  - "-r"
      #  - "arn:aws:iam::12345678901:role/sean-eks-role"
      #env:
      #  - name: AWS_PROFILE
      #    value: "seancli"%  

Related: Is AWS too complex for small dev teams?

5. Spinup the worker nodes

This is definitely the largest file in your terraform EKS code. Let me walk you through it a bit.

First we attach some policies to our role. These are all essential to EKS. They’re predefined but you need to group them together.

Then you need to create a security group for your worker nodes. Notice this also has the special kubernetes tag added. Be sure that it there or you’ll have problems.

Then we add some additional ingress rules, which allow workers & the control plane of kubernetes all to communicate with eachother.

Next you’ll see some serious user-data code. This handles all the startup action, on the worker node instances. Notice we reference some variables here, so be sure those are defined.

Lastly we create a launch configuration, and autoscaling group. Notice we give it the AMI as defined in the aws docs. These are EKS optimized images, with all the supporting software. Notice also they are only available currently in us-east-1 and us-west-1.

Notice also that the autoscaling group also has the special kubernetes tag. As I’ve been saying over and over, that super important.

#
# EKS Worker Nodes Resources
#  * IAM role allowing Kubernetes actions to access other AWS services
#  * EC2 Security Group to allow networking traffic
#  * Data source to fetch latest EKS worker AMI
#  * AutoScaling Launch Configuration to configure worker instances
#  * AutoScaling Group to launch worker instances
#

resource "aws_iam_role" "demo-node" {
  name = "terraform-eks-demo-node"

  assume_role_policy = --POLICY
{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Principal": {
        "Service": "ec2.amazonaws.com"
      },
      "Action": "sts:AssumeRole"
    }
  ]
}
POLICY
}

resource "aws_iam_role_policy_attachment" "demo-node-AmazonEKSWorkerNodePolicy" {
  policy_arn = "arn:aws:iam::aws:policy/AmazonEKSWorkerNodePolicy"
  role       = "${aws_iam_role.demo-node.name}"
}

resource "aws_iam_role_policy_attachment" "demo-node-AmazonEKS_CNI_Policy" {
  policy_arn = "arn:aws:iam::aws:policy/AmazonEKS_CNI_Policy"
  role       = "${aws_iam_role.demo-node.name}"
}

resource "aws_iam_role_policy_attachment" "demo-node-AmazonEC2ContainerRegistryReadOnly" {
  policy_arn = "arn:aws:iam::aws:policy/AmazonEC2ContainerRegistryReadOnly"
  role       = "${aws_iam_role.demo-node.name}"
}

resource "aws_iam_role_policy_attachment" "demo-node-lb" {
  policy_arn = "arn:aws:iam::12345678901:policy/eks-lb-policy"
  role       = "${aws_iam_role.demo-node.name}"
}

resource "aws_iam_instance_profile" "demo-node" {
  name = "terraform-eks-demo"
  role = "${aws_iam_role.demo-node.name}"
}

resource "aws_security_group" "demo-node" {
  name        = "terraform-eks-demo-node"
  description = "Security group for all nodes in the cluster"

  #  vpc_id      = "${aws_vpc.demo.id}"
  vpc_id = "${module.eks-vpc.vpc_id}"

  egress {
    from_port   = 0
    to_port     = 0
    protocol    = "-1"
    cidr_blocks = ["0.0.0.0/0"]
  }

  tags = "${
    map(
     "Name", "terraform-eks-demo-node",
     "kubernetes.io/cluster/${var.cluster-name}", "owned",
    )
  }"
}

resource "aws_security_group_rule" "demo-node-ingress-self" {
  description              = "Allow node to communicate with each other"
  from_port                = 0
  protocol                 = "-1"
  security_group_id        = "${aws_security_group.demo-node.id}"
  source_security_group_id = "${aws_security_group.demo-node.id}"
  to_port                  = 65535
  type                     = "ingress"
}

resource "aws_security_group_rule" "demo-node-ingress-cluster" {
  description              = "Allow worker Kubelets and pods to receive communication from the cluster control plane"
  from_port                = 1025
  protocol                 = "tcp"
  security_group_id        = "${aws_security_group.demo-node.id}"
  source_security_group_id = "${module.eks-vpc.default_security_group_id}"
  to_port                  = 65535
  type                     = "ingress"
}

data "aws_ami" "eks-worker" {
  filter {
    name   = "name"
    values = ["eks-worker-*"]
  }

  most_recent = true
  owners      = ["602401143452"] # Amazon
}

# EKS currently documents this required userdata for EKS worker nodes to
# properly configure Kubernetes applications on the EC2 instance.
# We utilize a Terraform local here to simplify Base64 encoding this
# information into the AutoScaling Launch Configuration.
# More information: https://amazon-eks.s3-us-west-2.amazonaws.com/1.10.3/2018-06-05/amazon-eks-nodegroup.yaml
locals {
  demo-node-userdata = --USERDATA
#!/bin/bash -xe

CA_CERTIFICATE_DIRECTORY=/etc/kubernetes/pki
CA_CERTIFICATE_FILE_PATH=$CA_CERTIFICATE_DIRECTORY/ca.crt
mkdir -p $CA_CERTIFICATE_DIRECTORY
echo "${aws_eks_cluster.eks-cluster.certificate_authority.0.data}" | base64 -d >  $CA_CERTIFICATE_FILE_PATH
INTERNAL_IP=$(curl -s http://169.254.169.254/latest/meta-data/local-ipv4)
sed -i s,MASTER_ENDPOINT,${aws_eks_cluster.eks-cluster.endpoint},g /var/lib/kubelet/kubeconfig
sed -i s,CLUSTER_NAME,${var.cluster-name},g /var/lib/kubelet/kubeconfig
sed -i s,REGION,${var.eks-region},g /etc/systemd/system/kubelet.service
sed -i s,MAX_PODS,20,g /etc/systemd/system/kubelet.service
sed -i s,MASTER_ENDPOINT,${aws_eks_cluster.eks-cluster.endpoint},g /etc/systemd/system/kubelet.service
sed -i s,INTERNAL_IP,$INTERNAL_IP,g /etc/systemd/system/kubelet.service
DNS_CLUSTER_IP=10.100.0.10
if [[ $INTERNAL_IP == 10.* ]] ; then DNS_CLUSTER_IP=172.20.0.10; fi
sed -i s,DNS_CLUSTER_IP,$DNS_CLUSTER_IP,g /etc/systemd/system/kubelet.service
sed -i s,CERTIFICATE_AUTHORITY_FILE,$CA_CERTIFICATE_FILE_PATH,g /var/lib/kubelet/kubeconfig
sed -i s,CLIENT_CA_FILE,$CA_CERTIFICATE_FILE_PATH,g  /etc/systemd/system/kubelet.service
systemctl daemon-reload
systemctl restart kubelet
USERDATA
}

resource "aws_launch_configuration" "demo" {
  associate_public_ip_address = true
  iam_instance_profile        = "${aws_iam_instance_profile.demo-node.name}"
  image_id                    = "${data.aws_ami.eks-worker.id}"
  instance_type               = "m4.large"
  name_prefix                 = "terraform-eks-demo"
  security_groups             = ["${aws_security_group.demo-node.id}"]
  user_data_base64            = "${base64encode(local.demo-node-userdata)}"

  lifecycle {
    create_before_destroy = true
  }
}

resource "aws_autoscaling_group" "demo" {
  desired_capacity     = 2
  launch_configuration = "${aws_launch_configuration.demo.id}"
  max_size             = 2
  min_size             = 1
  name                 = "terraform-eks-demo"

  #  vpc_zone_identifier  = ["${aws_subnet.demo.*.id}"]
  vpc_zone_identifier = ["${module.eks-vpc.public_subnets}"]

  tag {
    key                 = "Name"
    value               = "eks-worker-node"
    propagate_at_launch = true
  }

  tag {
    key                 = "kubernetes.io/cluster/${var.cluster-name}"
    value               = "owned"
    propagate_at_launch = true
  }
}

Related: How to hire a developer that doesn’t suck

6. Enable & Test worker nodes

If you haven’t already done so, apply all your above terraform:

$ terraform init
$ terraform plan
$ terraform apply

After that all runs, and all your resources are created. Now edit the file “aws-auth-cm.yaml” with the following contents:

apiVersion: v1
kind: ConfigMap
metadata:
  name: aws-auth
  namespace: kube-system
data:
  mapRoles: |
    - rolearn: arn:aws:iam::12345678901:role/terraform-eks-demo-node
      username: system:node:{{EC2PrivateDNSName}}
      groups:
        - system:bootstrappers
        - system:nodes% 

Then apply it to your cluster:

$ kubectl apply -f aws-auth-cm.yaml

you should be able to use kubectl to view node status:

$ kubectl get nodes
NAME                           STATUS    ROLES     AGE       VERSION
ip-10-0-101-189.ec2.internal   Ready         10d       v1.10.3
ip-10-0-102-182.ec2.internal   Ready         10d       v1.10.3
$ 

Related: Why would I help a customer that’s not paying?

7. Setup guestbook app

Finally you can follow the exact steps in the AWS docs to create the app. Here they are again:

$ kubectl apply -f https://raw.githubusercontent.com/kubernetes/kubernetes/v1.10.3/examples/guestbook-go/redis-master-controller.json
$ kubectl apply -f https://raw.githubusercontent.com/kubernetes/kubernetes/v1.10.3/examples/guestbook-go/redis-master-service.json
$ kubectl apply -f https://raw.githubusercontent.com/kubernetes/kubernetes/v1.10.3/examples/guestbook-go/redis-slave-controller.json
$ kubectl apply -f https://raw.githubusercontent.com/kubernetes/kubernetes/v1.10.3/examples/guestbook-go/redis-slave-service.json
$ kubectl apply -f https://raw.githubusercontent.com/kubernetes/kubernetes/v1.10.3/examples/guestbook-go/guestbook-controller.json
$ kubectl apply -f https://raw.githubusercontent.com/kubernetes/kubernetes/v1.10.3/examples/guestbook-go/guestbook-service.json

Then you can get the endpoint with kubectl:

$ kubectl get services        
NAME           TYPE           CLUSTER-IP       EXTERNAL-IP        PORT(S)          AGE
guestbook      LoadBalancer   172.20.177.126   aaaaa555ee87c...   3000:31710/TCP   4d
kubernetes     ClusterIP      172.20.0.1                    443/TCP          10d
redis-master   ClusterIP      172.20.242.65                 6379/TCP         4d
redis-slave    ClusterIP      172.20.163.1                  6379/TCP         4d
$ 

Use “kubectl get services -o wide” to see the entire EXTERNAL-IP. If that is saying you likely have an issue with your node iam role, or missing special kubernetes tags. So check on those. It shouldn’t show for more than a minute really.

Hope you got everything working.

Good luck and if you have questions, post them in the comments & I’ll try to help out!

Related: How to migrate my skills to the cloud?

Get more. Grab our exclusive monthly Scalable Startups. We share tips and special content. Our latest Why I don’t work with recruiters

Categories
All Cloud Computing CTO/CIO Devops

What are the key aws skills and how do you interview for them?

via GIPHY

Whether you’re striving for a new role as a Devops engineer, or a startup looking to hire one, you’ll need to be on the lookout for specific skills.

Join 38,000 others and follow Sean Hull on twitter @hullsean.

I’ve been on both sides of the fence, at times interviewing candidates, and other times the candidate looking to impress to win a new role.

Here are my suggestions…

Devops Pipeline

Jenkins isn’t the only build server, but it’s been around a long time, so it’s everywhere. You can also do well with CircleCI or Travis. Or even Amazon’s own CodeBuild & CodePipeline.

You should also be comfortable with a configuration management system. Ansible is my personal favorite but obviously there is lots of Puppet & Chef out there too. Talk about a playbook you wrote, how it configures the server, installs packages, edits configs and restarts services.

Bonus points if you can talk about handling deployments with autoscaling groups. Those dynamic environments can’t easily be captured in static host manifests, so talk about how you handle that.

Of course you should also be strong with Git, bitbucket or codecommit. Talk about how you create a branch, what’s gitflow and when/how do you tag a release.

Also be ready to talk about how a code checkin can trigger a post commit hook, which then can go and build your application, or new infra to test your code.

Related: How to avoid insane AWS bills

CloudFormation or Terraform

I’m partial to Terraform. Terraform is MacOSX or iPhone to CloudFormation as Android or Windows. Why do I say that? Well it’s more polished and a nicer language to write in. CloudFormation is downright ugly. But hey both get the job done.

Talk about some code you wrote, how you configured IAM roles and instance profiles, how you spinup an ECS cluster with Terraform for example.

Related: How best to do discovery in cloud and devops engagements?

AWS Services

There are lots of them. But the core services, are what you should be ready to talk about. CloudWatch for centralized logging. How does it integrate with ECS or EKS?

Route53, how do you create a zone? How do you do geo load balancing? How does it integrate with CertificateManager? Can Terraform build these things?

EC2 is the basic compute service. Tell me what happens when an instance dies? When it boots? What is a user-data script? How would you use one? What’s an AMI? How do you build them?

What about virtual networking? What is a VPC? And a private subnet? What’s a public subnet? How do you deploy a NAT? WHat’s it for? How do security groups work?

What are S3 buckets? Talk about infraquently accessed? How about glacier? What are lifecycle policies? How do you do cross region replication? How do you setup cloudfront? What’s a distribution?

What types of load balancers are there? Classic & Application are the main ones. How do they differ? ALB is smarter, it can integrate with ECS for example. What are some settings I should be concerned with? What about healthchecks?

What is Autoscaling? How do I setup EC2 instances to do this? What’s an autoscaling group? Target? How does it work with ECS? What about EKS?

Devops isn’t about writing application code, but you’re surely going to be writing jobs. What language do you like? Python and shell scripting  are a start. What about Lambda? Talk about frameworks to deploy applications.

Related: Are you getting good at Terraform or wrestling with a bear?

Databases

You should have some strong database skills even if you’re not the day-to-day DBA. Amazon RDS certainly makes administering a bit easier most of the time. But upgrade often require downtime, and unfortunately that’s wired into the service. I see mostly Postgresql, MySQL & Aurora. Get comfortable tuning SQL queries and optimizing. Analyze your slow query log and provide an output.

Amazon’s analytics offering is getting stronger. The purpose built Redshift is everywhere these days. It may use a postgresql driver, but there’s a lot more under the hood. You also may want to look at SPectrum, which provides a EXTERNAL TABLE type interface, to query data directly from S3.

Not on Redshift yet? Well you can use Athena as an interface directly onto your data sitting in S3. Even quicker.

For larger data analysis or folks that have systems built around the technology, Hadoop deployments or EMR may be good to know as well. At least be able to talk intelligently about it.

Related: Is zero downtime even possible on RDS?

Questions

Have you written any CloudFormation templates or Terraform code? For example how do you create a VPC with private & public subnets, plus bastion box with Terraform? What gotches do you run into?

If you are given a design document, how do you proceed from there? How do you build infra around those requirements? What is your first step? What questions would you ask about the doc?

What do you know about Nodejs? Or Python? Why do you prefer that language?

If you were asked to store 500 terrabytes of data on AWS and were going to do analysis of the data what would be your first choice? Why? Let’s say you evaluated S3 and Athena, and found the performance wasn’t there, what would you move to? Redshift? How would you load the data?

Describe a multi-az VPC setup that you recommend. How do you deploy multiple subnets in a high availability arragement?

Related: Why generalists are better at scaling the web

Get more. Grab our exclusive monthly Scalable Startups. We share tips and special content. Our latest Why I don’t work with recruiters