I tried to understand Amazon EKS internals and here’s what happened


EKS is a service to run kubernetes, so you don’t have to install the software, or manage or patch it. Just like GKS on Google, kubernetes as a service is really the way to go if you want to build kubernetes apps on AWS.

Join 38,000 others and follow Sean Hull on twitter @hullsean.

So where do we get started? AWS docs are still coming together, so it’s not easy. I would start with Jerry Hargrove’s amazing EKS diagram. If a picture is worth a thousand words, this one is work 10,000!

1. Build your EKS cluster

I already did this in Terraform. There aren’t a lot of howtos, so I wrote one.

Basically you setup the service role, the cluster, then the worker nodes. Once you’ve done that you’re ready to run the demo app.

Related: When you have to take the fall

2. Build your app spec

These are very similar to ECS tasks. You’ll need to make slight changes. mountPoints become VolumeMounts, links get removed, and workingDirectory becomes workingDir and so on. Most of these changes are obvious, but the json syntax is obviously the biggest bear you’ll wrestle with.

When done do this:

$ kubectl apply -f my-controller.json

Related: When clients don’t pay

3. Build the service spec

The service is quite a bit different than an ECS service. I suggest starting from the guestbook service. Find it here

Edit that and add your own app name & details. Then apply:

$ kubectl apply -f my-service.json

Related: Why i ask for a deposit

4. Get the endpoint and go!

$ kubectl get service -o wide

You should see the EXTERNAL-IP display a loadbalancer endpoint. Copy that into your browser and you should see your app running.

Related: Why i ask for a deposit

Get more. Grab our exclusive monthly Scalable Startups. We share tips and special content. Our latest Why I don’t work with recruiters

How do we migrate our business to the public cloud?


The public cloud is no longer a bleeding edge technology for the trailblazers. It’s mainstream now. As you think about it, you consider your customers and the SLAs they’ve come to expect.

Join 38,000 others and follow Sean Hull on twitter @hullsean.

It’s not if, but when to move to the cloud, how to get there, and how fast will be the transition?

Here are my thoughts on what to start thinking about.

1. Ramp up team, skills & paradigm thinking

Teams with experience in traditional datacenters have certain ways of architecting solutions, and thinking about problems. For example they may choose NFS servers to host objects, where in the cloud you will use object storage such as S3.

S3 has all sorts of new features, like lifecycle policies, and super super redundant eleven 9’s of durability. But your applications may need to be retrofitted to work with it, and your devs may need to learn about new features and functionality.

What about networking? This changes a lot in the cloud, with VPCs, and virtual appliances like NATs and Gateways. And what about security groups?

Interacting with this new world of cloud resources, requires new skillsets and new ways of thinking. So priority one will be getting your engineering teams learning, and upgrading skills. I wrote a piece about this how do I migrate my skills to the cloud?

Related: When you have to take the fall

2. Adapt to a new security model

With the old style datacenter, you typically have a firewall, and everything gets blocked & controlled. The new world of cloud computing uses security groups. These can be applied at the network level, across your VPC, or at the server level. And of course you can have many security groups with overlapping jurisdictions. Here’s how you setup a VPC with Terraform

So understanding how things work in the public cloud is quite new and challenging. There are ingress and egress rules, ways to audit with network flow logs, and more.

However again, it’s one thing to have the features available, it’s quite another to put them to proper use.

Related: When clients don’t pay

3. Adapt to fragile components & networks

While the public cloud collectively is extremely resilient, the individual components such as EC2 instances are decidedly not reliable. It’s expected that they can and will die frequently. It’s your job as the customer to build things in a self-healing way.

That means VPCs with multiple subnets, across availability zones (multi-az). And that means redundant instances for everything. What’s more you front your servers with load balancers (classic or application). These themselves are redundant.

Whether you are building a containerized application and deploying on ECS or a traditional auto-scaling webserver with database backend, you’ll need to plan for failure. And that means code that detects, and reacts to such failures without downtime to the end user.

Related: Why i ask for a deposit

4. Build infrastructure as code

You’ve heard about devops, now it’s time to put it into practice. Building your complete stack in code, is very possible with tools like Terraform. But you may have trouble along the way. I wrote I tried to write infra as code with Terraform and AWS and it didn’t go as expected

So there’s a learning curve. Both for your operations teams who have previously called Rackspace to get a new server provisioned. And also for your business, learning what incurs an outage, and the tricky finicky sides to managing your public cloud through code.

Related: Why i ask for a deposit

5. Audit, log & monitor

As you automate more and more pieces, you may have less confidence in the overall scope of your deployments. How many servers am I using right now? How many S3 buckets? What about elastic IPs?

As your automation can itself spinup new temporary environments, those resource counts will change from moment to moment. Even a spike in user engagement or a sudden flash sale, can change your cloud footprint in an instant.

That’s where heavy use of logging such as ELK (elasticsearch, logstash and kibana) can really help. Sure AWS offers CloudWatch and CloudTrail, but again you must put it all to good use.

Related: Why i ask for a deposit

Get more. Grab our exclusive monthly Scalable Startups. We share tips and special content. Our latest Why I don’t work with recruiters

How to setup an Amazon EKS demo with Terraform


Since EKS is pretty new, there aren’t a lot of howtos on it yet.

I wanted to follow along with Amazon’s Getting started with EKS & Kubernetes Guide.

However I didn’t want to use cloudformation. We all know Terraform is far superior!

Join 38,000 others and follow Sean Hull on twitter @hullsean.

With that I went to work getting it going. And a learned a few lessons along the way.

My steps follow pretty closely with the Amazon guide above, and setting up the guestbook app. The only big difference is I’m using Terraform.

1. create the EKS service role

Create a file called eks-iam-role.tf and add the following:

resource "aws_iam_role" "demo-cluster" {
  name = "terraform-eks-demo-cluster"

  assume_role_policy = <

Note we reference demo-cluster resource. We define that in step #3 below.

Related: How to setup Amazon ECS with Terraform

2. Create the EKS vpc

Here's the code to create the VPC. I'm using the Terraform community module to do this.

There are two things to notice here. One is I reference eks-region variable. Add this in your vars.tf. "us-east-1" or whatever you like. Also add cluster-name to your vars.tf.

Also notice the special tags. Those are super important. If you don't tag your resources properly, kubernetes won't be able to do it's thing. Or rather EKS won't. I had this problem early on and it is very hard to diagnose. The tags in this vpc module, with propagate to subnets, and security groups which is also crucial.

provider "aws" {
  region = "${var.eks-region}"

module "eks-vpc" {
  source = "terraform-aws-modules/vpc/aws"

  name = "eks-vpc"
  cidr = ""

  azs             = "${var.eks-azs}"
  private_subnets = "${var.eks-private-cidrs}"
  public_subnets  = "${var.eks-public-cidrs}"

  enable_nat_gateway = false
  single_nat_gateway = true

  #  reuse_nat_ips        = "${var.eks-reuse-eip}"
  enable_vpn_gateway = false

  #  external_nat_ip_ids  = ["${var.eks-nat-fixed-eip}"]
  enable_dns_hostnames = true

  tags = {
    Terraform                                   = "true"
    Environment                                 = "${var.environment_name}"
    "kubernetes.io/cluster/${var.cluster-name}" = "shared"

resource "aws_security_group_rule" "allow_http" {
  type              = "ingress"
  from_port         = 80
  to_port           = 80
  protocol          = "TCP"
  security_group_id = "${module.eks-vpc.default_security_group_id}"
  cidr_blocks       = [""]

resource "aws_security_group_rule" "allow_guestbook" {
  type              = "ingress"
  from_port         = 3000
  to_port           = 3000
  protocol          = "TCP"
  security_group_id = "${module.eks-vpc.default_security_group_id}"
  cidr_blocks       = [""]

Related: How I resolved some tough Docker problems when i was troubleshooting amazon ECS

3. Create the EKS Cluster

Creating the cluster is a short bit of terraform code below. The aws_eks_cluster resource.

# main EKS terraform resource definition
resource "aws_eks_cluster" "eks-cluster" {
  name = "${var.cluster-name}"

  role_arn = "${aws_iam_role.demo-cluster.arn}"

  vpc_config {
    subnet_ids = ["${module.eks-vpc.public_subnets}"]

output "endpoint" {
  value = "${aws_eks_cluster.eks-cluster.endpoint}"

output "kubeconfig-certificate-authority-data" {
  value = "${aws_eks_cluster.eks-cluster.certificate_authority.0.data}"

Related: Is Amazon too big to fail?

4. Install & configure kubectl

The AWS docs are pretty good on this point.

First you need to install the client on your local desktop. For me i used brew install, the mac osx package manager. You'll also need the heptio-authenticator-aws binary. Again refer to the aws docs for help on this.

The main piece you will add is a directory (~/.kube) and edit this file ~/.kube/config as follows:

apiVersion: v1
- cluster:
    server: https://3A3C22EEF7477792E917CB0118DD3X22.yl4.us-east-1.eks.amazonaws.com
    certificate-authority-data: "a-really-really-long-string-of-characters"
  name: kubernetes
- context:
    cluster: kubernetes
    user: aws
  name: aws
current-context: aws
kind: Config
preferences: {}
- name: aws
      apiVersion: client.authentication.k8s.io/v1alpha1
      command: heptio-authenticator-aws
        - "token"
        - "-i"
        - "sean-eks"
      #  - "-r"
      #  - "arn:aws:iam::12345678901:role/sean-eks-role"
      #  - name: AWS_PROFILE
      #    value: "seancli"%  

Related: Is AWS too complex for small dev teams?

5. Spinup the worker nodes

This is definitely the largest file in your terraform EKS code. Let me walk you through it a bit.

First we attach some policies to our role. These are all essential to EKS. They're predefined but you need to group them together.

Then you need to create a security group for your worker nodes. Notice this also has the special kubernetes tag added. Be sure that it there or you'll have problems.

Then we add some additional ingress rules, which allow workers & the control plane of kubernetes all to communicate with eachother.

Next you'll see some serious user-data code. This handles all the startup action, on the worker node instances. Notice we reference some variables here, so be sure those are defined.

Lastly we create a launch configuration, and autoscaling group. Notice we give it the AMI as defined in the aws docs. These are EKS optimized images, with all the supporting software. Notice also they are only available currently in us-east-1 and us-west-1.

Notice also that the autoscaling group also has the special kubernetes tag. As I've been saying over and over, that super important.

# EKS Worker Nodes Resources
#  * IAM role allowing Kubernetes actions to access other AWS services
#  * EC2 Security Group to allow networking traffic
#  * Data source to fetch latest EKS worker AMI
#  * AutoScaling Launch Configuration to configure worker instances
#  * AutoScaling Group to launch worker instances

resource "aws_iam_role" "demo-node" {
  name = "terraform-eks-demo-node"

  assume_role_policy = <  $CA_CERTIFICATE_FILE_PATH
INTERNAL_IP=$(curl -s
sed -i s,MASTER_ENDPOINT,${aws_eks_cluster.eks-cluster.endpoint},g /var/lib/kubelet/kubeconfig
sed -i s,CLUSTER_NAME,${var.cluster-name},g /var/lib/kubelet/kubeconfig
sed -i s,REGION,${var.eks-region},g /etc/systemd/system/kubelet.service
sed -i s,MAX_PODS,20,g /etc/systemd/system/kubelet.service
sed -i s,MASTER_ENDPOINT,${aws_eks_cluster.eks-cluster.endpoint},g /etc/systemd/system/kubelet.service
sed -i s,INTERNAL_IP,$INTERNAL_IP,g /etc/systemd/system/kubelet.service
if [[ $INTERNAL_IP == 10.* ]] ; then DNS_CLUSTER_IP=; fi
sed -i s,DNS_CLUSTER_IP,$DNS_CLUSTER_IP,g /etc/systemd/system/kubelet.service
sed -i s,CERTIFICATE_AUTHORITY_FILE,$CA_CERTIFICATE_FILE_PATH,g /var/lib/kubelet/kubeconfig
sed -i s,CLIENT_CA_FILE,$CA_CERTIFICATE_FILE_PATH,g  /etc/systemd/system/kubelet.service
systemctl daemon-reload
systemctl restart kubelet

resource "aws_launch_configuration" "demo" {
  associate_public_ip_address = true
  iam_instance_profile        = "${aws_iam_instance_profile.demo-node.name}"
  image_id                    = "${data.aws_ami.eks-worker.id}"
  instance_type               = "m4.large"
  name_prefix                 = "terraform-eks-demo"
  security_groups             = ["${aws_security_group.demo-node.id}"]
  user_data_base64            = "${base64encode(local.demo-node-userdata)}"

  lifecycle {
    create_before_destroy = true

resource "aws_autoscaling_group" "demo" {
  desired_capacity     = 2
  launch_configuration = "${aws_launch_configuration.demo.id}"
  max_size             = 2
  min_size             = 1
  name                 = "terraform-eks-demo"

  #  vpc_zone_identifier  = ["${aws_subnet.demo.*.id}"]
  vpc_zone_identifier = ["${module.eks-vpc.public_subnets}"]

  tag {
    key                 = "Name"
    value               = "eks-worker-node"
    propagate_at_launch = true

  tag {
    key                 = "kubernetes.io/cluster/${var.cluster-name}"
    value               = "owned"
    propagate_at_launch = true

Related: How to hire a developer that doesn't suck

6. Enable & Test worker nodes

If you haven't already done so, apply all your above terraform:

$ terraform init
$ terraform plan
$ terraform apply

After that all runs, and all your resources are created. Now edit the file "aws-auth-cm.yaml" with the following contents:

apiVersion: v1
kind: ConfigMap
  name: aws-auth
  namespace: kube-system
  mapRoles: |
    - rolearn: arn:aws:iam::12345678901:role/terraform-eks-demo-node
      username: system:node:{{EC2PrivateDNSName}}
        - system:bootstrappers
        - system:nodes% 

Then apply it to your cluster:

$ kubectl apply -f aws-auth-cm.yaml

you should be able to use kubectl to view node status:

$ kubectl get nodes
NAME                           STATUS    ROLES     AGE       VERSION
ip-10-0-101-189.ec2.internal   Ready         10d       v1.10.3
ip-10-0-102-182.ec2.internal   Ready         10d       v1.10.3

Related: Why would I help a customer that's not paying?

7. Setup guestbook app

Finally you can follow the exact steps in the AWS docs to create the app. Here they are again:

$ kubectl apply -f https://raw.githubusercontent.com/kubernetes/kubernetes/v1.10.3/examples/guestbook-go/redis-master-controller.json
$ kubectl apply -f https://raw.githubusercontent.com/kubernetes/kubernetes/v1.10.3/examples/guestbook-go/redis-master-service.json
$ kubectl apply -f https://raw.githubusercontent.com/kubernetes/kubernetes/v1.10.3/examples/guestbook-go/redis-slave-controller.json
$ kubectl apply -f https://raw.githubusercontent.com/kubernetes/kubernetes/v1.10.3/examples/guestbook-go/redis-slave-service.json
$ kubectl apply -f https://raw.githubusercontent.com/kubernetes/kubernetes/v1.10.3/examples/guestbook-go/guestbook-controller.json
$ kubectl apply -f https://raw.githubusercontent.com/kubernetes/kubernetes/v1.10.3/examples/guestbook-go/guestbook-service.json

Then you can get the endpoint with kubectl:

$ kubectl get services        
NAME           TYPE           CLUSTER-IP       EXTERNAL-IP        PORT(S)          AGE
guestbook      LoadBalancer   aaaaa555ee87c...   3000:31710/TCP   4d
kubernetes     ClusterIP                    443/TCP          10d
redis-master   ClusterIP                 6379/TCP         4d
redis-slave    ClusterIP                  6379/TCP         4d

Use "kubectl get services -o wide" to see the entire EXTERNAL-IP. If that is saying you likely have an issue with your node iam role, or missing special kubernetes tags. So check on those. It shouldn't show for more than a minute really.

Hope you got everything working.

Good luck and if you have questions, post them in the comments & I'll try to help out!

Related: How to migrate my skills to the cloud?

Get more. Grab our exclusive monthly Scalable Startups. We share tips and special content. Our latest Why I don't work with recruiters

What are the key aws skills and how do you interview for them?


Whether you’re striving for a new role as a Devops engineer, or a startup looking to hire one, you’ll need to be on the lookout for specific skills.

Join 38,000 others and follow Sean Hull on twitter @hullsean.

I’ve been on both sides of the fence, at times interviewing candidates, and other times the candidate looking to impress to win a new role.

Here are my suggestions…

Devops Pipeline

Jenkins isn’t the only build server, but it’s been around a long time, so it’s everywhere. You can also do well with CircleCI or Travis. Or even Amazon’s own CodeBuild & CodePipeline.

You should also be comfortable with a configuration management system. Ansible is my personal favorite but obviously there is lots of Puppet & Chef out there too. Talk about a playbook you wrote, how it configures the server, installs packages, edits configs and restarts services.

Bonus points if you can talk about handling deployments with autoscaling groups. Those dynamic environments can’t easily be captured in static host manifests, so talk about how you handle that.

Of course you should also be strong with Git, bitbucket or codecommit. Talk about how you create a branch, what’s gitflow and when/how do you tag a release.

Also be ready to talk about how a code checkin can trigger a post commit hook, which then can go and build your application, or new infra to test your code.

Related: How to avoid insane AWS bills

CloudFormation or Terraform

I’m partial to Terraform. Terraform is MacOSX or iPhone to CloudFormation as Android or Windows. Why do I say that? Well it’s more polished and a nicer language to write in. CloudFormation is downright ugly. But hey both get the job done.

Talk about some code you wrote, how you configured IAM roles and instance profiles, how you spinup an ECS cluster with Terraform for example.

Related: How best to do discovery in cloud and devops engagements?

AWS Services

There are lots of them. But the core services, are what you should be ready to talk about. CloudWatch for centralized logging. How does it integrate with ECS or EKS?

Route53, how do you create a zone? How do you do geo load balancing? How does it integrate with CertificateManager? Can Terraform build these things?

EC2 is the basic compute service. Tell me what happens when an instance dies? When it boots? What is a user-data script? How would you use one? What’s an AMI? How do you build them?

What about virtual networking? What is a VPC? And a private subnet? What’s a public subnet? How do you deploy a NAT? WHat’s it for? How do security groups work?

What are S3 buckets? Talk about infraquently accessed? How about glacier? What are lifecycle policies? How do you do cross region replication? How do you setup cloudfront? What’s a distribution?

What types of load balancers are there? Classic & Application are the main ones. How do they differ? ALB is smarter, it can integrate with ECS for example. What are some settings I should be concerned with? What about healthchecks?

What is Autoscaling? How do I setup EC2 instances to do this? What’s an autoscaling group? Target? How does it work with ECS? What about EKS?

Devops isn’t about writing application code, but you’re surely going to be writing jobs. What language do you like? Python and shell scripting  are a start. What about Lambda? Talk about frameworks to deploy applications.

Related: Are you getting good at Terraform or wrestling with a bear?


You should have some strong database skills even if you’re not the day-to-day DBA. Amazon RDS certainly makes administering a bit easier most of the time. But upgrade often require downtime, and unfortunately that’s wired into the service. I see mostly Postgresql, MySQL & Aurora. Get comfortable tuning SQL queries and optimizing. Analyze your slow query log and provide an output.

Amazon’s analytics offering is getting stronger. The purpose built Redshift is everywhere these days. It may use a postgresql driver, but there’s a lot more under the hood. You also may want to look at SPectrum, which provides a EXTERNAL TABLE type interface, to query data directly from S3.

Not on Redshift yet? Well you can use Athena as an interface directly onto your data sitting in S3. Even quicker.

For larger data analysis or folks that have systems built around the technology, Hadoop deployments or EMR may be good to know as well. At least be able to talk intelligently about it.

Related: Is zero downtime even possible on RDS?


Have you written any CloudFormation templates or Terraform code? For example how do you create a VPC with private & public subnets, plus bastion box with Terraform? What gotches do you run into?

If you are given a design document, how do you proceed from there? How do you build infra around those requirements? What is your first step? What questions would you ask about the doc?

What do you know about Nodejs? Or Python? Why do you prefer that language?

If you were asked to store 500 terrabytes of data on AWS and were going to do analysis of the data what would be your first choice? Why? Let’s say you evaluated S3 and Athena, and found the performance wasn’t there, what would you move to? Redshift? How would you load the data?

Describe a multi-az VPC setup that you recommend. How do you deploy multiple subnets in a high availability arragement?

Related: Why generalists are better at scaling the web

Get more. Grab our exclusive monthly Scalable Startups. We share tips and special content. Our latest Why I don’t work with recruiters

I tried to build infrastructure as code Terraform and Amazon. It didn’t go as I expected.


As I was building infrastructure code, I stumbled quite a few times. You hit a wall and you have to work through those confusing and frustrating moments.

Join 38,000 others and follow Sean Hull on twitter @hullsean.

Here are a few of the lessons I learned in the process of building code for AWS. It’s not easy but when you get there you can enjoy the vistas. They’re pretty amazing.

Don’t pass credentials

As you build your applications, there are moments where components need to use AWS in some way. Your webserver needs to use S3 or your ELK box needs to use CloudWatch. Maybe you want to do an RDS backup, or list EC2 instances.

However it’s not safe to pass your access_key and secret_access_key around. Those should be for your desktop only. So how best to handle this in the cloud?

IAM roles to the rescue. These are collections of privileges. The cool thing is they can be assigned at the INSTANCE LEVEL. Meaning your whole server has permissions to use said resources.

Do this by first creating a role with the privileges you want. Create a json policy document which outlines the specific rules as you see fit. Then create an instance profile for that role.

When you create your ec2 instance in Terraform, you’ll specify that instance profile. Either by ARN or if Terraform created it, by resource ID.

Related: How to avoid insane AWS bills

Keep passwords out of code

Even though we know it should not happen, sometimes it does. We need to be vigilant to stay on top of this problem. There are projects like Pivotal’s credential scan. This can be used to check your source files for passwords.

What about something like RDS? You’re going to need to specify a password in your Terraform code right? Wrong! You can define a variable with no default as follows:

variable "my_rds_pass" {
  description = "password for rds database"

When Terraform comes upon this variable in your code, but sees there is no “default” value, it will prompt you when you do “$ terraform apply”

Related: How best to do discovery in cloud and devops engagements?

Versioning your code

When you first start building terraform code, chances are you create a directory, and some tf files, then do your “$ terraform apply”. When you watch that infra build for the first time, it’s exciting!

After you add more components, your code gets more complex. Hopefully you’ve created a git repo to house your code. You can check & commit the files, so you have them in a safe place. But of course there’s more to the equation than this.

How do you handle multiple environments, dev, stage & production all using the same code?

That’s where modules come in. Now at the beginning you may well have a module that looks like this:

module "all-proj" {

  source = "../"

  myvar = "true"
  myregion = "us-east-1"
  myami = "ami-64300001"

Etc and so on. That’s the first step in the right direction, however if you change your source code, all of your environments will now be using that code. They will get it as soon as you do “$ terraform apply” for each. That’s fine, but it doesn’t scale well.

Ultimately you want to manage your code like other software projects. So as you make changes, you’ll want to tag it.

So go ahead and checkin your latest changes:

# push your latest changes
$ git push origin master
# now tag it
$ git tag -a v0.1 -m "my latest coolest infra"
# now push the tags
$ git push origin v0.1

Great now you want to modify your module slightly. As follows:

module "all-proj" {

  source = "git::https://[email protected]/hullsean/myproj-infra.git?ref=v0.1"

  myvar = "true"
  myregion = "us-east-1"
  myami = "ami-64300001"

Cool! Now each dev, stage and prod can reference a different version. So you are free to work on the infra without interrupting stage or prod. When you’re ready to promote that code, checkin, tag and update stage.

You could go a step further to be more agile, and have a post-commit hook that triggers the stage terraform apply. This though requires you to build solid infra tests. Checkout testinfra and terratest.

Related: Are you getting good at Terraform or wrestling with a bear?

Managing RDS backups

Amazon’s RDS service is a bit weird. I wrote in the past asking Is upgrading RDS like a shit-storm that will not end?. Yes I’ve had my grievances.

My recent discovery is even more serious! Terraform wants to build infra. And it wants to be able to later destroy that infra. In the case of databases, obviously the previous state is one you want to keep. You want that to be perpetual, beyond the infra build. Obvious, no?

Apparently not to the folks at Amazon. When you destroy an RDS instance it will destroy all the old backups you created. I have no idea why anyone would want this. Certainly not as a default behavior. What’s worse you can’t copy those backups elsewhere. Why not? They’re probably sitting in S3 anyway!

While you can take a final backup when you destroy an RDS instance, that’s wondeful and I recommend it. However that’s not enough. I highly suggest you take matters into your own hands. Build a script that calls pg_dump yourself, and copy those .sql or .dump files to S3 for safe keeping.

Related: Is zero downtime even possible on RDS?

When to use force_destroy on S3 buckets

As with RDS, when you create S3 buckets with your infra, you want to be able to cleanup later. But the trouble is that once you create a bucket, you’ll likely fill it with objects and files.

What then happens is when you go to do “$ terraform destroy” it will fail with an error. This makes sense as a default behavior. We don’t want data disappearing without our knowledge.

However you do want to be able to cleanup. So what to do? Two things.

Firstly, create a process, perhaps a lambda job or other bucket replication to regularly sync your s3 bucket to your permanent bucket archive location. Run that every fifteen minutes or as often as you need.

Then add a force_destroy line to your s3 bucket resource. Here’s an example s3 bucket for storing load balancer logs:

data "aws_elb_service_account" "main" {}

resource "aws_s3_bucket" "lb_logs" {
  count         = "${var.create-logs-bucket ? 1 : 0}"
  force_destroy = "${var.force-destroy-logs-bucket}"
  bucket        = "${var.lb-logs-bucket}"
  acl           = "private"

  policy = POLICY
  "Id": "Policy",
  "Version": "2012-10-17",
  "Statement": [
      "Action": [
      "Effect": "Allow",
      "Resource": "arn:aws:s3:::${var.lb-logs-bucket}/*",
      "Principal": {
        "AWS": [

  tags {
    Environment = "${var.environment_name}"

NOTE: There should be “< <" above and to the left of POLICY. HTML was not having this, and I couldn't resolve it quickly. Oh well.

Related: Why generalists are better at scaling the web

Get more. Grab our exclusive monthly Scalable Startups. We share tips and special content. Our latest Why I don’t work with recruiters

How to use terraform to setup vpc & bastion box


If you’re building infrastructure on AWS or GCP you need a sandbox in which to place your toys. That sandbox is called a VPC. It’s one of those lovely acronyms that we in the tech world take for granted.

Join 38,000 others and follow Sean Hull on twitter @hullsean.

Those letters stand for Virtual Private Cloud, one of many networks within your cloud, that serve as a firewall, controlling access to servers, applications and other resources.

1. What is it for?

VPC partitions off your cloud, allowing you to control who gets into what. A VPC typically has a private Zone and a public Zone.

Within your private Zone you’ll have 2 or more private subnets and within your public, you’ll have two or more public subnets. These each sit in different availability zones, or data centers within a region. Having at least two means you can be redundant right from the start.

Related: 30 questions to ask a serverless fanboy

2. How to setup the VPC

Terraform has some excellent community modules that help you get on the ground running. One of those facilitates creating a VPC for you. When you create your VPC, the main things you want to think about are:

o what region am I building in?
o what az’s do I want to use?
o what network cidr’s to use?

You’ll have important outputs when you build your vpc. In particular the private subnets, public subnets and default security groups, which you will reference over and over in all of your terraform code. That’s because RDS databases, ec2 instances, redis clusters and many other resources sit inside of a subnet.

module "my-vpc" {
  source = "terraform-aws-modules/vpc/aws"

  name = "my-vpc"
  cidr = ""

  azs             = ["us-east-1a","us-east-1b"]
  private_subnets = ["", ""]
  public_subnets  = ["", ""]

  enable_nat_gateway   = true
  single_nat_gateway   = true
  reuse_nat_ips        = false
  enable_vpn_gateway   = false
  enable_dns_hostnames = true

  tags = {
    Terraform   = "true"
    Environment = "dev"

Note, this module can do a *lot* more. For example you can attached an unchanging or fixed IP (elastic IP in aws terminology) to the NAT device. This is useful so that your application appears to be coming from a single box all the time. It allows upstream providers, APIs and other integrations to whitelist you, allows your application and servers to tie into those services predictably and cleanly.

Also note that we created some nice tags. These tags become more and more important as you automate more of your infrastructure, because you will dig through the dashboard from time to time and can easily figure out what is what. You can also use a tag such as “monitoring = yes” to filter for resources that your monitoring system should tie into.

Related: How to use terraform to automate wordpress site deployment

3. How to add the bastion

You want to deploy all servers in private subnets. That’s because the internet is a dangerous place these days. Everything and I mean everything. From there you provide only two ways to reach those resources. A loac balancer fronts your applications, opening ports 80, 443 or other relavant ports. And a jump box fronts your ssh access.

Place the bastion box in your PUBLIC subnet, so that you can reach it from the outside internet.

Again we’re using an amazing community terraform module, which also implements another cool feature for us. Note we deploy mykey onto the box. Think of this as your master key. But you may want to provide other users access to these machaines. In that case, simply place their public keys into my-public-keys-bucket.

Terraform will automatically deploy a key copying job onto this box via user-data script. The job will run via cron every 15 minutes, and copy (sync rather) public keys into the authorized keys file. This will allow you to add/remove users easily.

There are of course many more sophisticated networks which would require more nuanced user control, but this method is great for starters. 🙂

module "my-bastion" {
  source                      = "github.com/terraform-community-modules/tf_aws_bastion_s3_keys"
  instance_type               = "t2.micro"
  ami                         = "ami-976152f2"
  region                      = "us-east-1"
  key_name                    = "mykey"
  iam_instance_profile        = "s3_readonly"
  s3_bucket_name              = "my-public-keys-bucket"
  vpc_id                      = "${module.my-vpc.vpc_id}"
  subnet_ids                  = "${module.my-vpc.public_subnets}"
  keys_update_frequency       = "*/15 * * * *"
  additional_user_data_script = "date"
  name  = "my-bastion"
  associate_public_ip_address = true
  ssh_user = "ec2-user"

# allow ssh coming from bastion to boxes in vpc
resource "aws_security_group_rule" "allow_ssh" {
  type            = "ingress"
  from_port       = 22
  to_port         = 22
  protocol        = "tcp"
  security_group_id = "${module.my-vpc.default_security_group_id}"
  source_security_group_id = "${module.my-bastion.security_group_id}" 

Related: How to automate Amazon ECS and Docker with Terraform

4. Add an EC2 instance

Now that we have a bastion box in the public subnet, we can use it as a jump box to resources sitting in the private subnets.

Let’s add an ec2 instance in one of our private subnets first. Then in the test section, you can actually reach those boxes by configuring your ssh config.

Here’s the code to create an ec2 instance. Create a file testbox.tf and add these lines. Then do the usual “$ terraform plan && terraform apply”

resource "aws_instance" "example" {
  ami           = "ami-976152f2"
  instance_type = "t2.micro"
  subnet_id = "${module.my-vpc.public_subnets}"
  key_name = "mykey"

Related: How do I migrate my skills to the cloud?

5. Testing

In order to test, you’ll need to edit your local ssh config file. This sits in ~/.ssh/config and defines names you can use on your local machine, to hit resources out there on the internet via ssh. Each definition includes a host, an ssh key and a user.

Below we define our bastion box. With that saved to our ssh config file, we can do “$ ssh bastion” and login to it without any password. Excellent!

The second section is even cooler. Remember that our testbox sits in a private subnet, so there is no route to it from the internet at all. Even if we changed it’s security group to allow all ports from all source IPs, it would still not be reachable. is not an internet IP, it is one only defined within the world of our private subnet.

The second section defines how to use bastion as a proxy to reach the testbox. Once that is added to our ssh config file, we can do “$ ssh testbox” and magically reach it in one hop, by using the bastion as a proxy.

Host bastion
   Hostname ec2-22-205-135-133.compute-1.amazonaws.com
   IdentityFile ~/.ssh/mykey.pem
   User ec2-user
   ForwardAgent yes

Host testbox
   IdentityFile ~/.ssh/mykey.pem
   User ec2-user
   ProxyCommand ssh bastion -W %h:%p

Related: Is AWS too complex for small dev teams?

Get more. Grab our exclusive monthly Scalable Startups. We share tips and special content. Our latest Why I don’t work with recruiters

How organizations can move faster with devops – a16z Sonal Chokshi interviews Nicole Forsgren & Jez Humble


We hear a lot about devops these days, and the promise is temendous. It originally evolved out of Agile operations. But how to get those benefits at *my* organization?

Join 38,000 others and follow Sean Hull on twitter @hullsean.

How do we become a high performing organization, to move faster and build more secure and resilient systems? That’s the $64,000 question!

A16Z strikes again! Andreeson Horowitz’s epic podcast hosts world class guests around all sorts of startup & new technology topics. This week they interview Jez Humble and Nicole Forsgren. They run Dora which is DevOps Research and Assessment, which shows organizations just how to get the advantages of devops in the real world.

Technology does not drive organizational performance

Check out section 16:04 in the podcast…

“the point of distinction comes from how you tie process and culture together technology through devops”

It’s the classic Amazon model. They’re running hundreds of experiments in production at any one time!

Related: The 4 letter word dividing dev and ops

Day one is short, day two is long

The first interesting quote that caught my attention was at 4:40…

“Day one is when we create all of these systems. Day two is when we deploy to production. We have to deploy and maintain forever and ever and ever. We hope that day two is really long.”

As a long time op, this really really resonates for me. Brownfield deployments, which have already seen a wave of developers finish, and leave, and trying to manage that. Not easy!

Related: Why generalists are better at scaling the web

Mainframes of Kubernetes?

What about tooling? Is that important? Here’s what Jez has to say. Jump to 29:30…

“Implementing those technologies does *not* give you those outcomes. You can achieve those results with Mainframes. Equally you can use Kubernetes, Docker and microservices and not achieve those outcomes.”

Related: Is Amazon too big to fail?

Reducing Friction

Fast forward to timecode 28:45…

“Conways Law: Organizations which design systems are constrained to produce designs that are copies of the communication structures of these organizations.”

ie your software code looks like the shape of organization itself, and how we communicate. Super interesting. 🙂

Related: 6 devops interview questions

Get more. Grab our exclusive monthly Scalable Startups. We share tips and special content. Our latest Why I don’t work with recruiters

How to setup an Amazon ECS cluster with Terraform


ECS is Amazon’s Elastic Container Service. That’s greek for how you get docker containers running in the cloud. It’s sort of like Kubernetes without all the bells and whistles.

It takes a bit of getting used to, but This terraform how to, should get you moving. You need an EC2 host to run your containers on, you need a task that defines your container image & resources, and lastly a service which tells ECS which cluster to run on and registers with ALB if you have one.

Join 38,000 others and follow Sean Hull on twitter @hullsean.

For each of these sections, create files: roles.tf, instance.tf, task.tf, service.tf, alb.tf. What I would recommend is create the first file roles.tf, then do:

$ terraform init
$ terraform plan
$ terraform apply

Then move on to instance.tf and do the terraform apply. One by one, next task, then service then finally alb. This way if you encounter errors, you can troubleshoot minimally, rather than digging through five files for the culprit.

This howto also requires a vpc. Terraform has a very good community vpc which will get you going in no time.

I recommend deploying in the public subnets for your first run, to avoid complexity of jump box, and private IPs for ecs instance etc.

Good luck!

May the terraform force be with you!

First setup roles

Roles are a really brilliant part of the aws stack. Inside of IAM or identity access and management, you can create roles. These are collections of privileges. I’m allowed to use this S3 bucket, but not others. I can use EC2, but not Athena. And so forth. There are some special policies already created just for ECS and you’ll need roles to use them.

These roles will be applied at the instance level, so your ecs host doesn’t have to pass credentials around. Clean. Secure. Smart!

resource "aws_iam_role" "ecs-instance-role" {
name = "ecs-instance-role"
path = "/"
assume_role_policy = "${data.aws_iam_policy_document.ecs-instance-policy.json}"

data "aws_iam_policy_document" "ecs-instance-policy" {
statement {
actions = ["sts:AssumeRole"]

principals {
type = "Service"
identifiers = ["ec2.amazonaws.com"]

resource "aws_iam_role_policy_attachment" "ecs-instance-role-attachment" {
role = "${aws_iam_role.ecs-instance-role.name}"
policy_arn = "arn:aws:iam::aws:policy/service-role/AmazonEC2ContainerServiceforEC2Role"

resource "aws_iam_instance_profile" "ecs-instance-profile" {
name = "ecs-instance-profile"
path = "/"
role = "${aws_iam_role.ecs-instance-role.id}"
provisioner "local-exec" {
command = "sleep 60"

resource "aws_iam_role" "ecs-service-role" {
name = "ecs-service-role"
path = "/"
assume_role_policy = "${data.aws_iam_policy_document.ecs-service-policy.json}"

resource "aws_iam_role_policy_attachment" "ecs-service-role-attachment" {
role = "${aws_iam_role.ecs-service-role.name}"
policy_arn = "arn:aws:iam::aws:policy/service-role/AmazonEC2ContainerServiceRole"

data "aws_iam_policy_document" "ecs-service-policy" {
statement {
actions = ["sts:AssumeRole"]

principals {
type = "Service"
identifiers = ["ecs.amazonaws.com"]

Related: 30 questions to ask a serverless fanboy

Setup your ecs host instance

Next you need EC2 instances on which to run your docker containers. Turns out AWS has already built AMIs just for this purpose. They call them ECS Optimized Images. There is one unique AMI id for each region. So be sure you’re using the right one for your setup.

The other thing that your instance needs to do is echo the cluster name to /etc/ecs/ecs.config. You can see us doing that in the user_data script section.

Lastly we’re configuring our instance inside of an auto-scaling group. That’s so we can easily add more instances dynamically to scale up or down as necessary.

# the ECS optimized AMI's change by region. You can lookup the AMI here:
# https://docs.aws.amazon.com/AmazonECS/latest/developerguide/ecs-optimized_AMI.html
# us-east-1 ami-aff65ad2
# us-east-2 ami-64300001
# us-west-1 ami-69677709
# us-west-2 ami-40ddb938

# need to add security group config
# so that we can ssh into an ecs host from bastion box

resource "aws_launch_configuration" "ecs-launch-configuration" {
name = "ecs-launch-configuration"
image_id = "ami-aff65ad2"
instance_type = "t2.medium"
iam_instance_profile = "${aws_iam_instance_profile.ecs-instance-profile.id}"

root_block_device {
volume_type = "standard"
volume_size = 100
delete_on_termination = true

lifecycle {
create_before_destroy = true

associate_public_ip_address = "false"
key_name = "testone"

# register the cluster name with ecs-agent which will in turn coord
# with the AWS api about the cluster
user_data = <> /etc/ecs/ecs.config

# need an ASG so we can easily add more ecs host nodes as necessary
resource "aws_autoscaling_group" "ecs-autoscaling-group" {
name = "ecs-autoscaling-group"
max_size = "2"
min_size = "1"
desired_capacity = "1"

# vpc_zone_identifier = ["subnet-41395d29"]
vpc_zone_identifier = ["${module.new-vpc.private_subnets}"]
launch_configuration = "${aws_launch_configuration.ecs-launch-configuration.name}"
health_check_type = "ELB"

tag {
key = "Name"
value = "ECS-myecscluster"
propagate_at_launch = true

resource "aws_ecs_cluster" "test-ecs-cluster" {
name = "myecscluster"

Related: Is there a serious skills shortage in the devops space?

Setup your task definition

The third thing you need is a task. This one will spinup a generic nginx container. It’s a nice way to demonstrate things. For your real world usage, you’ll replace the image line with a docker image that you’ve pushed to ECR. I’ll leave that as an exercise. Once you have the cluster working, you should get the hang of things.

Note the portmappings, memory and CPU. All things you might expect to see in a docker-compose.yml file. So these tasks should look somewhat familiar.

data "aws_ecs_task_definition" "test" {
task_definition = "${aws_ecs_task_definition.test.family}"
depends_on = ["aws_ecs_task_definition.test"]

resource "aws_ecs_task_definition" "test" {
family = "test-family"

container_definitions = <

Related: Is AWS too complex for small dev teams?

Setup your service definition

The fourth thing you need to do is setup a service. The task above is a manifest, describing your containers needs. It is now registered, but nothing is running.

When you apply the service your container will startup. What I like to do is, ssh into the ecs host box. Get comfortable. Then issue $ watch "docker ps". This will repeatedly run "docker ps" every two seconds. Once you have that running, do your terraform apply for this service piece.

As you watch, you'll see ECS start your container, and it will suddenly appear in your watch terminal. It will first show "starting". Once it is started, it should say "healthy".

resource "aws_ecs_service" "test-ecs-service" {
name = "test-vz-service"
cluster = "${aws_ecs_cluster.test-ecs-cluster.id}"
task_definition = "${aws_ecs_task_definition.test.family}:${max("${aws_ecs_task_definition.test.revision}", "${data.aws_ecs_task_definition.test.revision}")}"
desired_count = 1
iam_role = "${aws_iam_role.ecs-service-role.name}"

load_balancer {
target_group_arn = "${aws_alb_target_group.test.id}"
container_name = "nginx"
container_port = "80"

depends_on = [
# "aws_iam_role_policy.ecs-service",

Related: Does AWS have a dirty secret?

Setup your application load balancer

The above will all work by itself. However for a real-world use case, you'll want to have an ALB. This one has only a simple HTTP port 80 listener. These are much simpler than setting up 443 for SSL, so baby steps first.

Once you have the ALB going, new containers will register with the target group, to let the alb know about them. In "docker ps" you'll notice they are running on a lot of high numbered ports. These are the hostPorts which are dynamically assigned. The container ports are all 80.

resource "aws_alb_target_group" "test" {
name = "my-alb-group"
port = 80
protocol = "HTTP"
vpc_id = "${module.new-vpc.vpc_id}"

resource "aws_alb" "main" {
name = "my-alb-ecs"
subnets = ["${module.new-vpc.public_subnets}"]
security_groups = ["${module.new-vpc.default_security_group_id}"]

resource "aws_alb_listener" "front_end" {
load_balancer_arn = "${aws_alb.main.id}"
port = "80"
protocol = "HTTP"

default_action {
target_group_arn = "${aws_alb_target_group.test.id}"
type = "forward"

You will also want to add a domain name, so that as your infra changes, and if you rebuild your ALB, the name of your application doesn't vary. Route53 will adjust as terraform changes are applied. Pretty cool.

resource "aws_route53_record" "myapp" {
zone_id = "${aws_route53_zone.primary.zone_id}"
name = "myapp.mydomain.com"
type = "CNAME"
ttl = "60"
records = ["${aws_alb.main.dns_name}"]

depends_on = ["aws_alb.main"]

Related: How to deploy on EC2 with vagrant

Get more. Grab our exclusive monthly Scalable Startups. We share tips and special content. Our latest Why I don't work with recruiters

How do we test performance in a microservices world?


I recently ran across this interesting question on a technology forum.

“I’m an engineering team lead at a startup in NYC. Our app is written in Ruby on Rails and hosted on Heroku. We use metrics such as the built-in metrics on Heroku, as well as New Relic for performance monitoring. This summer, we’re expecting a large influx of traffic from a new partnership and would like to have confidence that our system can handle the load.”

“I’ve tried to wrap my head around different types of performance/load testing tools like JMeter, Blazemeter, and others. Additionally, I’ve experimented with scripts which have grown more complex and I’m following rabbit holes of functionality within JMeter (such as loading a CSV file for dynamic user login, and using response data in subsequent requests, etc.). Ultimately, I feel this might be best left to consultants or experts who could be far more experienced and also provide our organization an opportunity to learn from them on key concepts and best practices.”

Join 38,000 others and follow Sean Hull on twitter @hullsean.

Here’s my point by point response.

I’ve been doing performance tuning since the old dot-com days.

It used to be you point a loadrunner type tool at your webpage and let it run. Then watch the load, memory & disk on your webserver or database. Before long you’d find some bottlenecks. Shortage of resources (memory, cpu, disk I/O) or slow queries were often the culprit. Optimizing queries, and ripping out those pesky ORMs usually did the trick.

Related: Why generalists are better at scaling the web

Today things are quite a bit more complicated. Yes jmeter & blazemeter are great tools. You might also get newrelic installed on your web nodes. This will give you instrumentation on where your app spends time. However it may still not be easy. With microservices, you have the docker container & orchestration layer to consider. In the AWS environment you can have bottlenecks on disk I/O where provisioned IOPS can help. But instance size also impacts network interfaces in the weird world of multi-tenant. So there’s that too!

Related: 5 things toxic to scalability

What’s more a lot of frameworks are starting to steer back towards ORMs again. Sadly this is not a good trend. On the flip side if you’re using RDS, your default MySQL or postgres settings may be decent. And newer versions of MySQL are getting some damn fancy & performant indexes. So there’s lots of improvement there.

Related: Anatomy of a performance review

There is also the question of simulating real users. What is a real user? What is an ACTIVE user? These are questions that may seem obvious, although I’ve worked at firms where engineering, product, sales & biz-dev all had different answers. But lets say you’ve answered that. Does are load test simply login the user? Or do they use a popular section of the site? Or how about an unpopular section of the site? Often we are guessing what “real world” users do and how they use our app.

Get more. Grab our exclusive monthly Scalable Startups. We share tips and special content. Our latest Why I don’t work with recruiters

How I resolved some tough docker problems on Amazon ECS


ECS is Amazon’s elastic container service. If you have a dockerized app, this is one way to get it deployed in the cloud. It is basically an Amazon bootleg Kubernetes clone. And not nearly as feature rich! 🙂

Join 38,000 others and follow Sean Hull on twitter @hullsean.

That said, ECS does work, and it will allow you to get your application going on Amazon. Soon enough EKS (Amazon’s Kubernetes service) will be production, and we’ll all happily switch.

Meantime, if you’re struggling with the weird errors, and when it is silently failing, I have some help here for you. Hopefully these various error cases are ones you’ve run into, and this helps you solve them.

Why is my container in a stopped state?

Containers can fail for a lot of different reasons. The litany of causes I found were:

o port mismatches
o missing links in the task definition
o shortage of resources (see #2 below)

When ecs repeatedly fails, it leaves around stopped containers. These eat up system resources, without much visible feedback. “df -k” or “df -m” doesn’t show you volumes filled up. *BUT* there are logical volumes which can fill.

Do this to see the status:

[[email protected] ~]# lvdisplay
--- Logical volume ---
LV Name docker-pool
VG Name docker
LV UUID aSSS-fEEE-d333-V999-e999-a000-t11111
LV Write Access read/write
LV Creation host, time ip-10-111-40-30, 2018-04-21 18:16:19 +0000
LV Pool metadata docker-pool_tmeta
LV Pool data docker-pool_tdata
LV Status available
# open 3
LV Size 21.73 GiB
Allocated pool data 18.81%
Allocated metadata 6.10%
Current LE 5562
Segments 1
Allocation inherit
Read ahead sectors auto
- currently set to 256
Block device 253:2

[[email protected] ~]#

Related: 30 questions to ask a serverless fanboy

Why am I getting this error “Couldn’t run containers – reason=RESOURCE:PORTS”?

I was seeing errors like this. Your first thought might be that I have multiple containers on the same port. But no I didn’t have a port conflict.

What was happening was containers were failing, but in inconsistent ways. So docker had old copies still sitting around.

On the ecs host, use “docker ps -a” to list *ALL* containers. Then use “docker system prune” to cleanup old resources.

INFO[0000] Using ECS task definition TaskDefinition="docker:5"
INFO[0000] Couldn't run containers reason="RESOURCE:PORTS"
INFO[0000] Couldn't run containers reason="RESOURCE:PORTS"
INFO[0000] Starting container... container=750f3d42-a0ce-454b-ac38-f42791462b76/sean-redis
INFO[0000] Starting container... container=750f3d42-a0ce-454b-ac38-f42791462b76/sean-main
INFO[0000] Starting container... container=750f3d42-a0ce-454b-ac38-f42791462b76/sean-postgres
INFO[0000] Describe ECS container status container=750f3d42-a0ce-454b-ac38-f42791462b76/sean-postgres desiredStatus=RUNNING lastStatus=PENDING taskDefinition="docker:5"
INFO[0000] Describe ECS container status container=750f3d42-a0ce-454b-ac38-f42791462b76/sean-redis desiredStatus=RUNNING lastStatus=PENDING taskDefinition="docker:5"
INFO[0000] Describe ECS container status container=750f3d42-a0ce-454b-ac38-f42791462b76/sean-main desiredStatus=RUNNING lastStatus=PENDING taskDefinition="docker:5"

INFO[0007] Stopped container... container=750f3d42-a0ce-454b-ac38-f42791462b76/sean-postgres desiredStatus=STOPPED lastStatus=STOPPED taskDefinition="docker:5"
INFO[0007] Stopped container... container=750f3d42-a0ce-454b-ac38-f42791462b76/sean-redis desiredStatus=STOPPED lastStatus=STOPPED taskDefinition="docker:5"
INFO[0007] Stopped container... container=750f3d42-a0ce-454b-ac38-f42791462b76/sean-main desiredStatus=STOPPED lastStatus=STOPPED taskDefinition="docker:5"

Related: What’s the luckiest thing that’s happened in your career?

3. My container gets killed before fully started

When a service is run, ECS wants to have *all* of the containers running together. Just like when you use docker-compose. If one container fails, ecs-agent may decide to kill the entire service, and restart. So you may see weird things happening in “docker logs” for one container, simply because another failed. What to do?

First look at your task definition, and set “essential = false”. That way if one fails, the other will still run. So you can eliminate the working container as a cause.

Next thing is remember some containers may startup almost instantly, like nginx for example. Because it is a very small footprint, it can start in a second or two. So if *it* depends on another container that is slow, nginx will fail. That’s because in the strange world of docker discovery, that other container doesn’t even exist yet. While nginx references it, it says hey, I don’t see the upstream server you are pointing to.

Solution? Be sure you have a “links” section in your task definition. This tells ecs-agent, that one container depends on another (think of the depends_on flag in docker-compose).

Related: Curve ball interview questions and answers

4. Understanding container ordering

As you are building your ecs manifest aka task definition, you want to run through your docker-compose file carefully. Review the links, essential flags and depends_on settings. Then be sure to mirror those in your ECS task.

When in doubt, reduce the scope of your problem. That is define *only one* container, then start the service. Once that container works, add a second. When you get that working as well, add a third or other container.

This approach allows you to eliminate interconnecting dependencies, and related problems.

Related: Are generalists better at scaling the web?

Get more. Grab our exclusive monthly Scalable Startups. We share tips and special content. Our latest Why I don’t work with recruiters