Did Disney have to fail?

via GIPHY

Well it was a big day for Disney a couple of weeks ago when they launched the much anticipated Disney+ streaming service.

And of course as widely reported, they had a really bad outage.

We’ve heard the old refrain before. We never saw traffic like this before. We were absolutely buried.

Join 35,000 others and follow Sean Hull on twitter @hullsean.

Indeed, Disney, in their own microcosm never saw traffic like that before. But does cloud computing know how to solve this problem? Has the software, techops and devops profession solved these problems?

Answer: Yes. Just look at Facebook, Google, Netflix or a million other high traffic sites.

Interestingly, I got a letter from a recruiter over at Disney just the other day!

Hi Sean,

I wanted to reach out again to see if you’ve had an opportunity to think about exploring a future role with Disney Streaming Services?

The company is in the midst of a massive engineering expansion, as Disney+ launched today in their US, Canadian and Dutch markets. While phased with natural issues, given the pure scope of their launch, paired with unforeseen customer engagement, the team is continuing to actively to build out their engineering team in order to ensure seamless launches in Western Europe and the Asian Pacific countries in late 2019/ early 2020.

While you may not actively be on the market, I would be interested in having a preliminary chat to see if joining and impacting one of the largest scale technical projects the international video streaming industry has ever seen, could align with future career goals?

Looking forward to your response!

**Name redacted** 

Related: Why generalists are better at scaling the web

1. Management maturity

When I was reading the article above I stopped at this line:

“This is a new ballgame for Disney”

And that tells the whole story doesn’t it? Just because the industry has solved a problem, just because there are technologists out there who know how to do this, doesn’t mean they work at Disney! And further doesn’t mean Disney’s management is ready for their new reality.

But they learn quickly!

Read: Infrastructure provisioning – what is it and why is it important?

2. Streaming is a solved

The challenge of streaming content on the internet is not new. The pipes are there, the cloud can scale seemlessly. But yes there are a lot of moving parts. That’s what testing is for. And for a launch like this, one could easily launch a million test clients, using aws regions and zones around the world. And point them all at your new streaming service. Don’t want to spend the money? Scale down your stack, and send a proportional amount of traffic.

If you’re not seeing the autoscaling happen quickly enough, spinup *lots* of spare compute before launch time. That’s another option. You can always scale back after the flash hits.

And yes flash sales, daily deals or deal-a-day sites are another example. ideeli and the first unicorn Gilt Groupe

Sites like these deal with an explosion of traffic in a short period each day. Typically 90% of their traffic occurs in half to one hour of the day. So it’s a real herd that pummels the site.

Related: 6 Devops interview questions

3. Have doubts? Ask experts

To my mind, if a problem is solved, there’s no excuse to fail. But then it happens again at Airbnb and again at Dropbox and again many others.

Yes we can monitor. Yes we can test. Yes we can automate. Yes we can react. But still systems fail.

As a small pitch, I’ve helped companies like Hollywood Reporter (100m uniques per month), AppNexus, ideeli and SoulCycle scale their systems for hypergrowth. It’s not easy but it can be done!

Read: Is zero downtime even possible on RDS?

4. Good problem to have?

Oscar Wilde said…

The only thing worse than being talked about is not being talked about.

And if today’s political climate is any indication, their is some prescient irony buried there.

So even though Disney+ and many other big names have failed…

Perhaps it’s a good problem to have?

Read: How to hire a developer that doesn’t suck

Get more. Grab our exclusive monthly Scalable Startups. We share tips and special content. Our latest Why I don’t work with recruiters

 

When I found gold in my customer archives

via GIPHY

I’m good at keeping notes. I’ve blogged about Can progress reports & daily notes help engagements succeed. I would give that an emphatic YES!

From helping with communication, to sharing arcane details about blocking issues, struggles & hurdles, notes can illuminate things that a CTO or manager may not otherwise be aware of.

Join 35,000 others and follow Sean Hull on twitter @hullsean.

I was digging through mine archives recently, and found a bunch of notes on how to think about Terraform. In particular, how do you think about infrastructure as code? How do you architect to make it all work together?

1. A dead end started me backtracking

You’re going to dig your heels in by getting your application working. To do that you’ll spinup a vpn, private public subnet, bastion boxes, ECS hosts to deploy containers to, and an application load balancer endpoint. Getting that all working wasn’t terrible. We even included a prometheus node, to give us some monitoring visibility. We even added our jenkins server into the mix. Do you see where this is going?

At a certain point we of course needed to destroy the whole setup, but didn’t want to destroy the CI pipeline. Duh! And what about monitoring? Lose all that data each time no way!

Read: Infrastructure provisioning – what is it and why is it important?

2. Organize around VPCs

After dragging yourself through that, you see a bit better. It’s like standing at 20,000 feet.

Your vpn is a logical collection of instances. A box that holds your application, provides security, and gets created and destroyed with it. You can even see in your Terraform code, a subnet requires a VPN id within which to create it. And an instance requires a subnet within which to create it. For security reasons the application instances will sit within PRIVATE subnets, and only bastion box & load balancers will sit in PUBLIC subnets.

TO my mind that means each environment DEV, STAGE, PROD all get their own vpn. This also allows you to control who can access stage & production, as they have their own bastion access points.

Read: How to hire a developer that doesn’t suck

3. Build a utility VPC

What you’ll also see from the above story is that you need a place to have business wide, non-application services sit inside. Welcome the UTILITY VPC!

This can contain prometheus, ELK or other log collection service, your jenkins or other CI pipeline, and any other services that don’t logically fit within the application VPC.

Related: Why generalists are better at scaling the web

4. A VPC should be ok to destroy and rebuild in another region – in one-click

When you use infrastructure code, you want to test, create & destroy often. That shouldn’t disrupt anything. That means all state data should sit outside of those instances. Logging data, send it to logstash or cloudwatch. Application state, keep that inside of an RDS instance. And you’ve tested those backups right?

Speaking of RDS, I encountered problems with Amazon’s own backup & restore. For my money, I had a lot of problems and ended up writing a custom db dump script. That may require a custom restore to, so buyer beware. Here’s my story though… I tried to build infrastructure as code with Terraform and Amazon and it didn’t go as i expected.

You also may encounter issues when you move across regions, such as elastic IPs and so forth. And you’ll need to check and verify the code which creates and destroy S3 buckets and domain name certs. These areas gave me some hiccups, but you can work through it with diligence!

Read: Is zero downtime even possible on RDS?

Get more. Grab our exclusive monthly Scalable Startups. We share tips and special content. Our latest Why I don’t work with recruiters

I’m interviewed in a new book Devops Paradox

I had the good fortune to be interviewed earlier in the year by Viktor Farcic. I’m excited to see the book finally released. You can pickup your own copy of Devops Paradox here.

Join 35,000 others and follow Sean Hull on twitter @hullsean.

Our interview is pretty wide ranging, touching on aspects of database administration, cloud operations, automation and devops.

1. Defining Devops

Viktor Farcic: Moving on to a more general subject, how would you define DevOps? I’ve gotten a different answer from every single person I’ve asked.

Sean Hull: I have a lot of opinions about it actually. I wrote an article on my blog a few years ago called The Four-Letter Word Dividing Dev and Ops, with the implication being that the four-letter word might be a swear word, akin to the development team swearing at the operations team, and the operations team swearing at the development team. But the four-letter word I was referring to was “risk.”
To summarize my article, in my view the development and the operations teams of old were separate silos in business, and they had very different mandates. Developers are tasked with writing code to build a product and to answer the needs of the customers, while directly building change into and facilitating a more sophisticated product. So, their thinking from day to day is about change and answering the requirements of the products team.
On the other hand, the operations team’s mandate is stability. It’s, “I don’t want these systems going down at 2:00 a.m.” So, over the long term, the operations teams are thinking about being as conservative as possible and having fewer moving parts, less code, and less new technologies. The simpler your stack is, the more reliable it is and the more robust and less likely it is to fail. I think the traditional reason why developers and operations teams were separated into silos was because of those two very different mandates.
They’re two different ways of prioritizing your work and your priorities when you think about the business and the technology. However, the downside was that those teams didn’t really communicate very well, and they were often at each other’s throats, pushing each other in opposite directions. But to answer your question, “what is DevOps?” I think of DevOps as a cultural movement that has made efforts to allow those teams to communicate better, and that’s a really good thing.

Read: What happened when I offered advice outside my pay grade?

2. How to adapt to change

Viktor Farcic: I have the impression that the speed with which new things are coming is only increasing. How do you keep up with it, and how do companies you work with keep up with all that?

Sean Hull: I don’t think they do keep up. I’ve gone to a lot of companies where they’ve never used serverless. None of their engineers know serverless at all. Lambda, web tasks, and Google Cloud functions have been out for a while, but I think there are very few companies that are able to really take advantage of them. I wrote another article blog post called Is Amazon Web Services Too Complex for Small Dev Teams? where I sort of implied that it is.
I do find a lot of companies want the advantage of on-demand computing, but they really don’t have the in-house expertise yet to really take advantage of all the things that Amazon can do and offer. That’s exactly why people aren’t up to speed on the technology, as it’s just changing so quickly. I’m not sure what the answer is. For me personally, there’s definitely a lot of stuff that I don’t know. I know I’m stronger in Python than I am with Node.js. Some companies have Node.js, and you can write Lambda functions in Java, Node.js, Python, and Go. So, I think Amazon’s investment in new technology allows the platform to evolve faster than a lot of companies are able to really take advantage of it.

Read: What did Matt Ranney discover scaling Uber to 1000 microservices?

3. What does the future hold for Devops?

Viktor Farcic: I’m going to ask you a question now that I hate being asked, so you’re allowed not to answer. Where do you see the future, let’s say a year from now?

Sean Hull:
I see more fragmentation happening across the technology landscape, and I think that that is ultimately making things more fragile because, for example, with microservices, companies don’t think twice about having Ruby, Python, Node.js, and Java. They have 10 different stacks, so when you hire new people, either you have to ask them to learn all those stacks or you have to hire people with each of those individual areas of expertise. The same is true with all these different clouds with their own sets of features: there’s a fragmentation happening.
Let’s look at the iPhone as an example. Think about how complex application testing is for Android versus the iPhone. I mean, you have hundreds of different smartphones that run Android, all with different screen sizes, different hardware, different amounts of memory, and the underlying stuff. Some may even have some extra chips that others don’t have, so how do you test your application across all those different platforms?
When you have fragmentation like that, it means the applications end up not working as well. I think the same thing is happening across the technology spectrum today that happened 10 to 15 years ago, where for your database backend there was Oracle, SQL Server, MySQL, and Postgres. Maybe somebody who’s a DB2 enterprise customer uses DB2, but now there are hundreds of open source databases, graph databases, and DynamoDB versus Cassandra, and so on and so on. There’s no real deep expertise in any of those databases.
What ends up happening is you have cases like what happened with customers who were using MongoDB. They found out the hard way about all of the weird behaviors and performance problems it had, because there just weren’t people around with deep knowledge of what was happening behind the scenes, whereas in Oracle’s space, for example, there are career DBAs that are performance experts that specialize in Oracle internals, so you can hire somebody to solve particular problems in that space.
There aren’t, as far as I know, a lot of people with MongoDB internals expertise. You’d have to call MongoDB themselves; maybe they have a few engineers that they can send out, so what’s the future? I see a lot of fragmentation and complexity, and that makes the internet and internet applications more fragile, more brittle, and more prone to failure.

Related: Can humility help you in your career?

Get more. Grab our exclusive monthly Scalable Startups. We share tips and special content. Our latest Why I don’t work with recruiters

Should we right size instances in the public cloud?

via GIPHY

I’m a big fan of Corey Quinn’s Last Week in AWS newsletter.

Recently he wrote a piece titled Right Sizing your instances is nonesense

Not to be outdone, a blogger Joe at Sunshower.io wrote a counterpoint piece Why Right Sizing instances is not nonsense.

Join 35,000 others and follow Sean Hull on twitter @hullsean.

So what’s the verdict here? Is Corey wrong and Joe right? Or is Joe right and Corey wrong?

I would argue it depends. Corey’s piece emphasizes the big picture, essentially that technical changes can buy more trouble than the money they save. While

1. Corey emphasizes the 300 foot view

Corey’s article uses a broad brush, and in doing so some of his specifics are incorrect. For example the point about older versions of Operating Systems and hypervisor. In most cases your OS won’t know it’s running on a hypbervisor at all, so this seems a very rare edge case indeed.

That said his big picture conclusion seems spot on. Changing instance sizes can be a huge risk if you don’t do so regularly. With legacy apps, who knows how they might behave. In this you carry the same burden as changing instances at all. Your AMI may not be ready, or you may have had some manual steps to build the box again.

Related: How can we keep cloud architectures simple

2. Joe gets down in the weeds of technical specifics

The first thing about Joe’s post that caught my attention was to use the console to change the instance size. You shouldn’t be manually changing instances through the console to start with. What, everything is code, all the way down? Hehe…

Also his points about cost savings seemed cherry picked. If you can save that much money by changing instance sizes, it is well worth the cost to do regression, integration & disaster recovery tests to make sure it will all work. Get on it!

Also: What hidden things does a deposit reveal?

3. Use infrastructure as code, test & retest it

IAC is really the way to go. But that still doesn’t mean you’re out of the woods. Be cautious when changing instance sizes!

With code, there is testing. So change your Terraform code and then verify it works. Now just because you have a variable indicating the instance size, doesn’t mean changing it won’t break something.

Read: Can communication mixups sour an engagement?

4. Weigh the cost savings with the risk of breaking things

Changing an instance size and redeploying could break all manner of things. It’s possible you used a variable for the instance size in some places and hard coded it in others. Or made some weird reference in an autoscaling group.

It may be the AMI you’ve built works on one type of instance but not another. Or that your AMI is deployed in one region but not another. Or that your old instance size is available in us-east-2, but your new instance size is not yet available there. Yes the console wouldn’t have offered it, but your Terraform code didn’t know.

Check out: How I use 5 daily habits to stay on track

5. Put up some guiderails

As you’ll see further down in the comments, Joe suggests to put limits around instance size changes. That makes sense. After you’ve done testing, you’ll have an idea of what these limits should be. No instance size less than 30GB memory? No instance size greater than X. None in region Y. Etc.

It’ll require tweaking your terraform infrastructure code, so it’s not a free change. But it’ll pay dividends if your costs savings are in the thousands.

Also: Can daily notes help you work better with clients?

Get more. Grab our exclusive monthly Scalable Startups. We share tips and special content. Our latest Why I don’t work with recruiters

How to setup an Amazon EKS demo with Terraform

via GIPHY

Since EKS is pretty new, there aren’t a lot of howtos on it yet.

I wanted to follow along with Amazon’s Getting started with EKS & Kubernetes Guide.

However I didn’t want to use cloudformation. We all know Terraform is far superior!

Join 38,000 others and follow Sean Hull on twitter @hullsean.

With that I went to work getting it going. And a learned a few lessons along the way.

My steps follow pretty closely with the Amazon guide above, and setting up the guestbook app. The only big difference is I’m using Terraform.

1. create the EKS service role

Create a file called eks-iam-role.tf and add the following:

resource "aws_iam_role" "demo-cluster" {
  name = "terraform-eks-demo-cluster"

  assume_role_policy = --POLICY
{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Principal": {
        "Service": "eks.amazonaws.com"
      },
      "Action": "sts:AssumeRole"
    }
  ]
}
POLICY
}

resource "aws_iam_role_policy_attachment" "demo-cluster-AmazonEKSClusterPolicy" {
  policy_arn = "arn:aws:iam::aws:policy/AmazonEKSClusterPolicy"
  role       = "${aws_iam_role.demo-cluster.name}"
}

resource "aws_iam_role_policy_attachment" "demo-cluster-AmazonEKSServicePolicy" {
  policy_arn = "arn:aws:iam::aws:policy/AmazonEKSServicePolicy"
  role       = "${aws_iam_role.demo-cluster.name}"
}

Note we reference demo-cluster resource. We define that in step #3 below.

Related: How to setup Amazon ECS with Terraform

2. Create the EKS vpc

Here’s the code to create the VPC. I’m using the Terraform community module to do this.

There are two things to notice here. One is I reference eks-region variable. Add this in your vars.tf. “us-east-1” or whatever you like. Also add cluster-name to your vars.tf.

Also notice the special tags. Those are super important. If you don’t tag your resources properly, kubernetes won’t be able to do it’s thing. Or rather EKS won’t. I had this problem early on and it is very hard to diagnose. The tags in this vpc module, with propagate to subnets, and security groups which is also crucial.

#
provider "aws" {
  region = "${var.eks-region}"
}

#
module "eks-vpc" {
  source = "terraform-aws-modules/vpc/aws"

  name = "eks-vpc"
  cidr = "10.0.0.0/16"

  azs             = "${var.eks-azs}"
  private_subnets = "${var.eks-private-cidrs}"
  public_subnets  = "${var.eks-public-cidrs}"

  enable_nat_gateway = false
  single_nat_gateway = true

  #  reuse_nat_ips        = "${var.eks-reuse-eip}"
  enable_vpn_gateway = false

  #  external_nat_ip_ids  = ["${var.eks-nat-fixed-eip}"]
  enable_dns_hostnames = true

  tags = {
    Terraform                                   = "true"
    Environment                                 = "${var.environment_name}"
    "kubernetes.io/cluster/${var.cluster-name}" = "shared"
  }
}

resource "aws_security_group_rule" "allow_http" {
  type              = "ingress"
  from_port         = 80
  to_port           = 80
  protocol          = "TCP"
  security_group_id = "${module.eks-vpc.default_security_group_id}"
  cidr_blocks       = ["0.0.0.0/0"]
}

resource "aws_security_group_rule" "allow_guestbook" {
  type              = "ingress"
  from_port         = 3000
  to_port           = 3000
  protocol          = "TCP"
  security_group_id = "${module.eks-vpc.default_security_group_id}"
  cidr_blocks       = ["0.0.0.0/0"]
}

Related: How I resolved some tough Docker problems when i was troubleshooting amazon ECS

3. Create the EKS Cluster

Creating the cluster is a short bit of terraform code below. The aws_eks_cluster resource.

#
# main EKS terraform resource definition
#
resource "aws_eks_cluster" "eks-cluster" {
  name = "${var.cluster-name}"

  role_arn = "${aws_iam_role.demo-cluster.arn}"

  vpc_config {
    subnet_ids = ["${module.eks-vpc.public_subnets}"]
  }
}

output "endpoint" {
  value = "${aws_eks_cluster.eks-cluster.endpoint}"
}

output "kubeconfig-certificate-authority-data" {
  value = "${aws_eks_cluster.eks-cluster.certificate_authority.0.data}"
}

Related: Is Amazon too big to fail?

4. Install & configure kubectl

The AWS docs are pretty good on this point.

First you need to install the client on your local desktop. For me i used brew install, the mac osx package manager. You’ll also need the heptio-authenticator-aws binary. Again refer to the aws docs for help on this.

The main piece you will add is a directory (~/.kube) and edit this file ~/.kube/config as follows:

apiVersion: v1
clusters:
- cluster:
    server: https://3A3C22EEF7477792E917CB0118DD3X22.yl4.us-east-1.eks.amazonaws.com
    certificate-authority-data: "a-really-really-long-string-of-characters"
  name: kubernetes
contexts:
- context:
    cluster: kubernetes
    user: aws
  name: aws
current-context: aws
kind: Config
preferences: {}
users:
- name: aws
  user:
    exec:
      apiVersion: client.authentication.k8s.io/v1alpha1
      command: heptio-authenticator-aws
      args:
        - "token"
        - "-i"
        - "sean-eks"
      #  - "-r"
      #  - "arn:aws:iam::12345678901:role/sean-eks-role"
      #env:
      #  - name: AWS_PROFILE
      #    value: "seancli"%  

Related: Is AWS too complex for small dev teams?

5. Spinup the worker nodes

This is definitely the largest file in your terraform EKS code. Let me walk you through it a bit.

First we attach some policies to our role. These are all essential to EKS. They’re predefined but you need to group them together.

Then you need to create a security group for your worker nodes. Notice this also has the special kubernetes tag added. Be sure that it there or you’ll have problems.

Then we add some additional ingress rules, which allow workers & the control plane of kubernetes all to communicate with eachother.

Next you’ll see some serious user-data code. This handles all the startup action, on the worker node instances. Notice we reference some variables here, so be sure those are defined.

Lastly we create a launch configuration, and autoscaling group. Notice we give it the AMI as defined in the aws docs. These are EKS optimized images, with all the supporting software. Notice also they are only available currently in us-east-1 and us-west-1.

Notice also that the autoscaling group also has the special kubernetes tag. As I’ve been saying over and over, that super important.

#
# EKS Worker Nodes Resources
#  * IAM role allowing Kubernetes actions to access other AWS services
#  * EC2 Security Group to allow networking traffic
#  * Data source to fetch latest EKS worker AMI
#  * AutoScaling Launch Configuration to configure worker instances
#  * AutoScaling Group to launch worker instances
#

resource "aws_iam_role" "demo-node" {
  name = "terraform-eks-demo-node"

  assume_role_policy = --POLICY
{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Principal": {
        "Service": "ec2.amazonaws.com"
      },
      "Action": "sts:AssumeRole"
    }
  ]
}
POLICY
}

resource "aws_iam_role_policy_attachment" "demo-node-AmazonEKSWorkerNodePolicy" {
  policy_arn = "arn:aws:iam::aws:policy/AmazonEKSWorkerNodePolicy"
  role       = "${aws_iam_role.demo-node.name}"
}

resource "aws_iam_role_policy_attachment" "demo-node-AmazonEKS_CNI_Policy" {
  policy_arn = "arn:aws:iam::aws:policy/AmazonEKS_CNI_Policy"
  role       = "${aws_iam_role.demo-node.name}"
}

resource "aws_iam_role_policy_attachment" "demo-node-AmazonEC2ContainerRegistryReadOnly" {
  policy_arn = "arn:aws:iam::aws:policy/AmazonEC2ContainerRegistryReadOnly"
  role       = "${aws_iam_role.demo-node.name}"
}

resource "aws_iam_role_policy_attachment" "demo-node-lb" {
  policy_arn = "arn:aws:iam::12345678901:policy/eks-lb-policy"
  role       = "${aws_iam_role.demo-node.name}"
}

resource "aws_iam_instance_profile" "demo-node" {
  name = "terraform-eks-demo"
  role = "${aws_iam_role.demo-node.name}"
}

resource "aws_security_group" "demo-node" {
  name        = "terraform-eks-demo-node"
  description = "Security group for all nodes in the cluster"

  #  vpc_id      = "${aws_vpc.demo.id}"
  vpc_id = "${module.eks-vpc.vpc_id}"

  egress {
    from_port   = 0
    to_port     = 0
    protocol    = "-1"
    cidr_blocks = ["0.0.0.0/0"]
  }

  tags = "${
    map(
     "Name", "terraform-eks-demo-node",
     "kubernetes.io/cluster/${var.cluster-name}", "owned",
    )
  }"
}

resource "aws_security_group_rule" "demo-node-ingress-self" {
  description              = "Allow node to communicate with each other"
  from_port                = 0
  protocol                 = "-1"
  security_group_id        = "${aws_security_group.demo-node.id}"
  source_security_group_id = "${aws_security_group.demo-node.id}"
  to_port                  = 65535
  type                     = "ingress"
}

resource "aws_security_group_rule" "demo-node-ingress-cluster" {
  description              = "Allow worker Kubelets and pods to receive communication from the cluster control plane"
  from_port                = 1025
  protocol                 = "tcp"
  security_group_id        = "${aws_security_group.demo-node.id}"
  source_security_group_id = "${module.eks-vpc.default_security_group_id}"
  to_port                  = 65535
  type                     = "ingress"
}

data "aws_ami" "eks-worker" {
  filter {
    name   = "name"
    values = ["eks-worker-*"]
  }

  most_recent = true
  owners      = ["602401143452"] # Amazon
}

# EKS currently documents this required userdata for EKS worker nodes to
# properly configure Kubernetes applications on the EC2 instance.
# We utilize a Terraform local here to simplify Base64 encoding this
# information into the AutoScaling Launch Configuration.
# More information: https://amazon-eks.s3-us-west-2.amazonaws.com/1.10.3/2018-06-05/amazon-eks-nodegroup.yaml
locals {
  demo-node-userdata = --USERDATA
#!/bin/bash -xe

CA_CERTIFICATE_DIRECTORY=/etc/kubernetes/pki
CA_CERTIFICATE_FILE_PATH=$CA_CERTIFICATE_DIRECTORY/ca.crt
mkdir -p $CA_CERTIFICATE_DIRECTORY
echo "${aws_eks_cluster.eks-cluster.certificate_authority.0.data}" | base64 -d >  $CA_CERTIFICATE_FILE_PATH
INTERNAL_IP=$(curl -s http://169.254.169.254/latest/meta-data/local-ipv4)
sed -i s,MASTER_ENDPOINT,${aws_eks_cluster.eks-cluster.endpoint},g /var/lib/kubelet/kubeconfig
sed -i s,CLUSTER_NAME,${var.cluster-name},g /var/lib/kubelet/kubeconfig
sed -i s,REGION,${var.eks-region},g /etc/systemd/system/kubelet.service
sed -i s,MAX_PODS,20,g /etc/systemd/system/kubelet.service
sed -i s,MASTER_ENDPOINT,${aws_eks_cluster.eks-cluster.endpoint},g /etc/systemd/system/kubelet.service
sed -i s,INTERNAL_IP,$INTERNAL_IP,g /etc/systemd/system/kubelet.service
DNS_CLUSTER_IP=10.100.0.10
if [[ $INTERNAL_IP == 10.* ]] ; then DNS_CLUSTER_IP=172.20.0.10; fi
sed -i s,DNS_CLUSTER_IP,$DNS_CLUSTER_IP,g /etc/systemd/system/kubelet.service
sed -i s,CERTIFICATE_AUTHORITY_FILE,$CA_CERTIFICATE_FILE_PATH,g /var/lib/kubelet/kubeconfig
sed -i s,CLIENT_CA_FILE,$CA_CERTIFICATE_FILE_PATH,g  /etc/systemd/system/kubelet.service
systemctl daemon-reload
systemctl restart kubelet
USERDATA
}

resource "aws_launch_configuration" "demo" {
  associate_public_ip_address = true
  iam_instance_profile        = "${aws_iam_instance_profile.demo-node.name}"
  image_id                    = "${data.aws_ami.eks-worker.id}"
  instance_type               = "m4.large"
  name_prefix                 = "terraform-eks-demo"
  security_groups             = ["${aws_security_group.demo-node.id}"]
  user_data_base64            = "${base64encode(local.demo-node-userdata)}"

  lifecycle {
    create_before_destroy = true
  }
}

resource "aws_autoscaling_group" "demo" {
  desired_capacity     = 2
  launch_configuration = "${aws_launch_configuration.demo.id}"
  max_size             = 2
  min_size             = 1
  name                 = "terraform-eks-demo"

  #  vpc_zone_identifier  = ["${aws_subnet.demo.*.id}"]
  vpc_zone_identifier = ["${module.eks-vpc.public_subnets}"]

  tag {
    key                 = "Name"
    value               = "eks-worker-node"
    propagate_at_launch = true
  }

  tag {
    key                 = "kubernetes.io/cluster/${var.cluster-name}"
    value               = "owned"
    propagate_at_launch = true
  }
}

Related: How to hire a developer that doesn’t suck

6. Enable & Test worker nodes

If you haven’t already done so, apply all your above terraform:

$ terraform init
$ terraform plan
$ terraform apply

After that all runs, and all your resources are created. Now edit the file “aws-auth-cm.yaml” with the following contents:

apiVersion: v1
kind: ConfigMap
metadata:
  name: aws-auth
  namespace: kube-system
data:
  mapRoles: |
    - rolearn: arn:aws:iam::12345678901:role/terraform-eks-demo-node
      username: system:node:{{EC2PrivateDNSName}}
      groups:
        - system:bootstrappers
        - system:nodes% 

Then apply it to your cluster:

$ kubectl apply -f aws-auth-cm.yaml

you should be able to use kubectl to view node status:

$ kubectl get nodes
NAME                           STATUS    ROLES     AGE       VERSION
ip-10-0-101-189.ec2.internal   Ready         10d       v1.10.3
ip-10-0-102-182.ec2.internal   Ready         10d       v1.10.3
$ 

Related: Why would I help a customer that’s not paying?

7. Setup guestbook app

Finally you can follow the exact steps in the AWS docs to create the app. Here they are again:

$ kubectl apply -f https://raw.githubusercontent.com/kubernetes/kubernetes/v1.10.3/examples/guestbook-go/redis-master-controller.json
$ kubectl apply -f https://raw.githubusercontent.com/kubernetes/kubernetes/v1.10.3/examples/guestbook-go/redis-master-service.json
$ kubectl apply -f https://raw.githubusercontent.com/kubernetes/kubernetes/v1.10.3/examples/guestbook-go/redis-slave-controller.json
$ kubectl apply -f https://raw.githubusercontent.com/kubernetes/kubernetes/v1.10.3/examples/guestbook-go/redis-slave-service.json
$ kubectl apply -f https://raw.githubusercontent.com/kubernetes/kubernetes/v1.10.3/examples/guestbook-go/guestbook-controller.json
$ kubectl apply -f https://raw.githubusercontent.com/kubernetes/kubernetes/v1.10.3/examples/guestbook-go/guestbook-service.json

Then you can get the endpoint with kubectl:

$ kubectl get services        
NAME           TYPE           CLUSTER-IP       EXTERNAL-IP        PORT(S)          AGE
guestbook      LoadBalancer   172.20.177.126   aaaaa555ee87c...   3000:31710/TCP   4d
kubernetes     ClusterIP      172.20.0.1                    443/TCP          10d
redis-master   ClusterIP      172.20.242.65                 6379/TCP         4d
redis-slave    ClusterIP      172.20.163.1                  6379/TCP         4d
$ 

Use “kubectl get services -o wide” to see the entire EXTERNAL-IP. If that is saying you likely have an issue with your node iam role, or missing special kubernetes tags. So check on those. It shouldn’t show for more than a minute really.

Hope you got everything working.

Good luck and if you have questions, post them in the comments & I’ll try to help out!

Related: How to migrate my skills to the cloud?

Get more. Grab our exclusive monthly Scalable Startups. We share tips and special content. Our latest Why I don’t work with recruiters

I tried to build infrastructure as code Terraform and Amazon. It didn’t go as I expected.

via GIPHY

As I was building infrastructure code, I stumbled quite a few times. You hit a wall and you have to work through those confusing and frustrating moments.

Join 38,000 others and follow Sean Hull on twitter @hullsean.

Here are a few of the lessons I learned in the process of building code for AWS. It’s not easy but when you get there you can enjoy the vistas. They’re pretty amazing.

Don’t pass credentials

As you build your applications, there are moments where components need to use AWS in some way. Your webserver needs to use S3 or your ELK box needs to use CloudWatch. Maybe you want to do an RDS backup, or list EC2 instances.

However it’s not safe to pass your access_key and secret_access_key around. Those should be for your desktop only. So how best to handle this in the cloud?

IAM roles to the rescue. These are collections of privileges. The cool thing is they can be assigned at the INSTANCE LEVEL. Meaning your whole server has permissions to use said resources.

Do this by first creating a role with the privileges you want. Create a json policy document which outlines the specific rules as you see fit. Then create an instance profile for that role.

When you create your ec2 instance in Terraform, you’ll specify that instance profile. Either by ARN or if Terraform created it, by resource ID.

Related: How to avoid insane AWS bills

Keep passwords out of code

Even though we know it should not happen, sometimes it does. We need to be vigilant to stay on top of this problem. There are projects like Pivotal’s credential scan. This can be used to check your source files for passwords.

What about something like RDS? You’re going to need to specify a password in your Terraform code right? Wrong! You can define a variable with no default as follows:

variable "my_rds_pass" {
  description = "password for rds database"
}

When Terraform comes upon this variable in your code, but sees there is no “default” value, it will prompt you when you do “$ terraform apply”

Related: How best to do discovery in cloud and devops engagements?

Versioning your code

When you first start building terraform code, chances are you create a directory, and some tf files, then do your “$ terraform apply”. When you watch that infra build for the first time, it’s exciting!

After you add more components, your code gets more complex. Hopefully you’ve created a git repo to house your code. You can check & commit the files, so you have them in a safe place. But of course there’s more to the equation than this.

How do you handle multiple environments, dev, stage & production all using the same code?

That’s where modules come in. Now at the beginning you may well have a module that looks like this:

module "all-proj" {

  source = "../"

  myvar = "true"
  myregion = "us-east-1"
  myami = "ami-64300001"
}

Etc and so on. That’s the first step in the right direction, however if you change your source code, all of your environments will now be using that code. They will get it as soon as you do “$ terraform apply” for each. That’s fine, but it doesn’t scale well.

Ultimately you want to manage your code like other software projects. So as you make changes, you’ll want to tag it.

So go ahead and checkin your latest changes:

# push your latest changes
$ git push origin master
# now tag it
$ git tag -a v0.1 -m "my latest coolest infra"
# now push the tags
$ git push origin v0.1

Great now you want to modify your module slightly. As follows:

module "all-proj" {

  source = "git::https://[email protected]/hullsean/myproj-infra.git?ref=v0.1"

  myvar = "true"
  myregion = "us-east-1"
  myami = "ami-64300001"
}

Cool! Now each dev, stage and prod can reference a different version. So you are free to work on the infra without interrupting stage or prod. When you’re ready to promote that code, checkin, tag and update stage.

You could go a step further to be more agile, and have a post-commit hook that triggers the stage terraform apply. This though requires you to build solid infra tests. Checkout testinfra and terratest.

Related: Are you getting good at Terraform or wrestling with a bear?

Managing RDS backups

Amazon’s RDS service is a bit weird. I wrote in the past asking Is upgrading RDS like a shit-storm that will not end?. Yes I’ve had my grievances.

My recent discovery is even more serious! Terraform wants to build infra. And it wants to be able to later destroy that infra. In the case of databases, obviously the previous state is one you want to keep. You want that to be perpetual, beyond the infra build. Obvious, no?

Apparently not to the folks at Amazon. When you destroy an RDS instance it will destroy all the old backups you created. I have no idea why anyone would want this. Certainly not as a default behavior. What’s worse you can’t copy those backups elsewhere. Why not? They’re probably sitting in S3 anyway!

While you can take a final backup when you destroy an RDS instance, that’s wondeful and I recommend it. However that’s not enough. I highly suggest you take matters into your own hands. Build a script that calls pg_dump yourself, and copy those .sql or .dump files to S3 for safe keeping.

Related: Is zero downtime even possible on RDS?

When to use force_destroy on S3 buckets

As with RDS, when you create S3 buckets with your infra, you want to be able to cleanup later. But the trouble is that once you create a bucket, you’ll likely fill it with objects and files.

What then happens is when you go to do “$ terraform destroy” it will fail with an error. This makes sense as a default behavior. We don’t want data disappearing without our knowledge.

However you do want to be able to cleanup. So what to do? Two things.

Firstly, create a process, perhaps a lambda job or other bucket replication to regularly sync your s3 bucket to your permanent bucket archive location. Run that every fifteen minutes or as often as you need.

Then add a force_destroy line to your s3 bucket resource. Here’s an example s3 bucket for storing load balancer logs:

data "aws_elb_service_account" "main" {}

resource "aws_s3_bucket" "lb_logs" {
  count         = "${var.create-logs-bucket ? 1 : 0}"
  force_destroy = "${var.force-destroy-logs-bucket}"
  bucket        = "${var.lb-logs-bucket}"
  acl           = "private"

  policy = POLICY
{
  "Id": "Policy",
  "Version": "2012-10-17",
  "Statement": [
    {
      "Action": [
        "s3:PutObject"
      ],
      "Effect": "Allow",
      "Resource": "arn:aws:s3:::${var.lb-logs-bucket}/*",
      "Principal": {
        "AWS": [
          "${data.aws_elb_service_account.main.arn}"
        ]
      }
    }
  ]
}
POLICY

  tags {
    Environment = "${var.environment_name}"
  }
}

NOTE: There should be “< <" above and to the left of POLICY. HTML was not having this, and I couldn't resolve it quickly. Oh well.

Related: Why generalists are better at scaling the web

Get more. Grab our exclusive monthly Scalable Startups. We share tips and special content. Our latest Why I don’t work with recruiters

How to avoid insane AWS bills

via GIPHY

I was flipping through the aws news recently and ran into this article by Juan Ramallo – I was billed 14k on AWS!

Scary stuff!

Join 38,000 others and follow Sean Hull on twitter @hullsean.

When you see headlines like this, your first instinct as a CTO is probably, “Am I at risk?” And then “What are the chances of this happening to me?”

Truth can be stranger than fiction. Our efforts as devops should be towards mitigating risk, and reducing potential for these kinds of things to happen.

1. Use aws instance profiles instead

Those credentials that aws provides, are great for enabling the awscli. That’s because you control your desktop tightly. Don’t you?

But passing them around in your application code is prone to trouble. Eventually they’ll end up in a git repo. Not good!

The solution is applying aws IAM permissions at the instance level. That’s right, you can grant an instance permissions to read or write an s3 bucket, describe instances, create & write to dynamodb, or anything else in aws. The entire cloud is api configurable. You create a custom policy for your instance, and attach it to a named instance profile.

When you spinup your EC2 instance, or later modify it, you attach that instance profile, and voila! The instance has those permissions! No messy credentials required!

Related: Is Amazon too big to fail?

2. Enable 2 factor authentication

If you haven’t already, you should force 2 factor authentication on all of your IAM users. It’s an extra step, but well well worth it. Here’s how to set it up

Mobile phones support all sorts of 2FA apps now, from Duo, to Authenticator, and many more.

Related: Is AWS too complex for small dev teams?

3. blah blah

Encourage developers to use tools like Pivotal’s Credentials Scan.

Hey, while you’re at it, why not add a post commit hook to your code repo in git. Have it run the credentials scan each time code is committed. And when it finds trouble, it should email out the whole team.

This will get everybody on board quick!

Related: Are we fast approaching cloud-mageddon?

4. Scan your S3 Buckets

Open S3 buckets can be a real disaster, offering up your private assets & business data to the world. What to do about it?

Scan your S3 buckets regularly.

Also you can tie in this scanning process to a monitoring alert. That way as soon as an errant bucket is found, you’re notified of the problem. Better safe than sorry!

Related: Which tech do startups use most?

Get more. Grab our exclusive monthly Scalable Startups. We share tips and special content. Our latest Why I don’t work with recruiters

How to setup an Amazon ECS cluster with Terraform

via GIPHY

ECS is Amazon’s Elastic Container Service. That’s greek for how you get docker containers running in the cloud. It’s sort of like Kubernetes without all the bells and whistles.

It takes a bit of getting used to, but This terraform how to, should get you moving. You need an EC2 host to run your containers on, you need a task that defines your container image & resources, and lastly a service which tells ECS which cluster to run on and registers with ALB if you have one.

Join 38,000 others and follow Sean Hull on twitter @hullsean.

For each of these sections, create files: roles.tf, instance.tf, task.tf, service.tf, alb.tf. What I would recommend is create the first file roles.tf, then do:


$ terraform init
$ terraform plan
$ terraform apply

Then move on to instance.tf and do the terraform apply. One by one, next task, then service then finally alb. This way if you encounter errors, you can troubleshoot minimally, rather than digging through five files for the culprit.

This howto also requires a vpc. Terraform has a very good community vpc which will get you going in no time.

I recommend deploying in the public subnets for your first run, to avoid complexity of jump box, and private IPs for ecs instance etc.

Good luck!

May the terraform force be with you!

First setup roles

Roles are a really brilliant part of the aws stack. Inside of IAM or identity access and management, you can create roles. These are collections of privileges. I’m allowed to use this S3 bucket, but not others. I can use EC2, but not Athena. And so forth. There are some special policies already created just for ECS and you’ll need roles to use them.

These roles will be applied at the instance level, so your ecs host doesn’t have to pass credentials around. Clean. Secure. Smart!


resource "aws_iam_role" "ecs-instance-role" {
name = "ecs-instance-role"
path = "/"
assume_role_policy = "${data.aws_iam_policy_document.ecs-instance-policy.json}"
}

data "aws_iam_policy_document" "ecs-instance-policy" {
statement {
actions = ["sts:AssumeRole"]

principals {
type = "Service"
identifiers = ["ec2.amazonaws.com"]
}
}
}

resource "aws_iam_role_policy_attachment" "ecs-instance-role-attachment" {
role = "${aws_iam_role.ecs-instance-role.name}"
policy_arn = "arn:aws:iam::aws:policy/service-role/AmazonEC2ContainerServiceforEC2Role"
}

resource "aws_iam_instance_profile" "ecs-instance-profile" {
name = "ecs-instance-profile"
path = "/"
role = "${aws_iam_role.ecs-instance-role.id}"
provisioner "local-exec" {
command = "sleep 60"
}
}

resource "aws_iam_role" "ecs-service-role" {
name = "ecs-service-role"
path = "/"
assume_role_policy = "${data.aws_iam_policy_document.ecs-service-policy.json}"
}

resource "aws_iam_role_policy_attachment" "ecs-service-role-attachment" {
role = "${aws_iam_role.ecs-service-role.name}"
policy_arn = "arn:aws:iam::aws:policy/service-role/AmazonEC2ContainerServiceRole"
}

data "aws_iam_policy_document" "ecs-service-policy" {
statement {
actions = ["sts:AssumeRole"]

principals {
type = "Service"
identifiers = ["ecs.amazonaws.com"]
}
}
}

Related: 30 questions to ask a serverless fanboy

Setup your ecs host instance

Next you need EC2 instances on which to run your docker containers. Turns out AWS has already built AMIs just for this purpose. They call them ECS Optimized Images. There is one unique AMI id for each region. So be sure you’re using the right one for your setup.

The other thing that your instance needs to do is echo the cluster name to /etc/ecs/ecs.config. You can see us doing that in the user_data script section.

Lastly we’re configuring our instance inside of an auto-scaling group. That’s so we can easily add more instances dynamically to scale up or down as necessary.


#
# the ECS optimized AMI's change by region. You can lookup the AMI here:
# https://docs.aws.amazon.com/AmazonECS/latest/developerguide/ecs-optimized_AMI.html
#
# us-east-1 ami-aff65ad2
# us-east-2 ami-64300001
# us-west-1 ami-69677709
# us-west-2 ami-40ddb938
#

#
# need to add security group config
# so that we can ssh into an ecs host from bastion box
#

resource "aws_launch_configuration" "ecs-launch-configuration" {
name = "ecs-launch-configuration"
image_id = "ami-aff65ad2"
instance_type = "t2.medium"
iam_instance_profile = "${aws_iam_instance_profile.ecs-instance-profile.id}"

root_block_device {
volume_type = "standard"
volume_size = 100
delete_on_termination = true
}

lifecycle {
create_before_destroy = true
}

associate_public_ip_address = "false"
key_name = "testone"

#
# register the cluster name with ecs-agent which will in turn coord
# with the AWS api about the cluster
#
user_data = <> /etc/ecs/ecs.config
EOF
}

#
# need an ASG so we can easily add more ecs host nodes as necessary
#
resource "aws_autoscaling_group" "ecs-autoscaling-group" {
name = "ecs-autoscaling-group"
max_size = "2"
min_size = "1"
desired_capacity = "1"

# vpc_zone_identifier = ["subnet-41395d29"]
vpc_zone_identifier = ["${module.new-vpc.private_subnets}"]
launch_configuration = "${aws_launch_configuration.ecs-launch-configuration.name}"
health_check_type = "ELB"

tag {
key = "Name"
value = "ECS-myecscluster"
propagate_at_launch = true
}
}

resource "aws_ecs_cluster" "test-ecs-cluster" {
name = "myecscluster"
}

Related: Is there a serious skills shortage in the devops space?

Setup your task definition

The third thing you need is a task. This one will spinup a generic nginx container. It’s a nice way to demonstrate things. For your real world usage, you’ll replace the image line with a docker image that you’ve pushed to ECR. I’ll leave that as an exercise. Once you have the cluster working, you should get the hang of things.

Note the portmappings, memory and CPU. All things you might expect to see in a docker-compose.yml file. So these tasks should look somewhat familiar.


data "aws_ecs_task_definition" "test" {
task_definition = "${aws_ecs_task_definition.test.family}"
depends_on = ["aws_ecs_task_definition.test"]
}

resource "aws_ecs_task_definition" "test" {
family = "test-family"

container_definitions = < < DEFINITION [ { "name": "nginx", "image": "nginx:latest", "memory": 128, "cpu": 128, "essential": true, "portMappings": [ { "hostPort": 0, "containerPort": 80, "protocol": "tcp" } ] } ] DEFINITION }

Related: Is AWS too complex for small dev teams?

Setup your service definition

The fourth thing you need to do is setup a service. The task above is a manifest, describing your containers needs. It is now registered, but nothing is running.

When you apply the service your container will startup. What I like to do is, ssh into the ecs host box. Get comfortable. Then issue $ watch "docker ps". This will repeatedly run "docker ps" every two seconds. Once you have that running, do your terraform apply for this service piece.

As you watch, you'll see ECS start your container, and it will suddenly appear in your watch terminal. It will first show "starting". Once it is started, it should say "healthy".


resource "aws_ecs_service" "test-ecs-service" {
name = "test-vz-service"
cluster = "${aws_ecs_cluster.test-ecs-cluster.id}"
task_definition = "${aws_ecs_task_definition.test.family}:${max("${aws_ecs_task_definition.test.revision}", "${data.aws_ecs_task_definition.test.revision}")}"
desired_count = 1
iam_role = "${aws_iam_role.ecs-service-role.name}"

load_balancer {
target_group_arn = "${aws_alb_target_group.test.id}"
container_name = "nginx"
container_port = "80"
}

depends_on = [
# "aws_iam_role_policy.ecs-service",
"aws_alb_listener.front_end",
]
}

Related: Does AWS have a dirty secret?

Setup your application load balancer

The above will all work by itself. However for a real-world use case, you'll want to have an ALB. This one has only a simple HTTP port 80 listener. These are much simpler than setting up 443 for SSL, so baby steps first.

Once you have the ALB going, new containers will register with the target group, to let the alb know about them. In "docker ps" you'll notice they are running on a lot of high numbered ports. These are the hostPorts which are dynamically assigned. The container ports are all 80.


#
#
resource "aws_alb_target_group" "test" {
name = "my-alb-group"
port = 80
protocol = "HTTP"
vpc_id = "${module.new-vpc.vpc_id}"
}

resource "aws_alb" "main" {
name = "my-alb-ecs"
subnets = ["${module.new-vpc.public_subnets}"]
security_groups = ["${module.new-vpc.default_security_group_id}"]
}

resource "aws_alb_listener" "front_end" {
load_balancer_arn = "${aws_alb.main.id}"
port = "80"
protocol = "HTTP"

default_action {
target_group_arn = "${aws_alb_target_group.test.id}"
type = "forward"
}
}

You will also want to add a domain name, so that as your infra changes, and if you rebuild your ALB, the name of your application doesn't vary. Route53 will adjust as terraform changes are applied. Pretty cool.


resource "aws_route53_record" "myapp" {
zone_id = "${aws_route53_zone.primary.zone_id}"
name = "myapp.mydomain.com"
type = "CNAME"
ttl = "60"
records = ["${aws_alb.main.dns_name}"]

depends_on = ["aws_alb.main"]
}

Related: How to deploy on EC2 with vagrant

Get more. Grab our exclusive monthly Scalable Startups. We share tips and special content. Our latest Why I don't work with recruiters

How do we test performance in a microservices world?

via GIPHY

I recently ran across this interesting question on a technology forum.

“I’m an engineering team lead at a startup in NYC. Our app is written in Ruby on Rails and hosted on Heroku. We use metrics such as the built-in metrics on Heroku, as well as New Relic for performance monitoring. This summer, we’re expecting a large influx of traffic from a new partnership and would like to have confidence that our system can handle the load.”

“I’ve tried to wrap my head around different types of performance/load testing tools like JMeter, Blazemeter, and others. Additionally, I’ve experimented with scripts which have grown more complex and I’m following rabbit holes of functionality within JMeter (such as loading a CSV file for dynamic user login, and using response data in subsequent requests, etc.). Ultimately, I feel this might be best left to consultants or experts who could be far more experienced and also provide our organization an opportunity to learn from them on key concepts and best practices.”

Join 38,000 others and follow Sean Hull on twitter @hullsean.

Here’s my point by point response.

I’ve been doing performance tuning since the old dot-com days.

It used to be you point a loadrunner type tool at your webpage and let it run. Then watch the load, memory & disk on your webserver or database. Before long you’d find some bottlenecks. Shortage of resources (memory, cpu, disk I/O) or slow queries were often the culprit. Optimizing queries, and ripping out those pesky ORMs usually did the trick.

Related: Why generalists are better at scaling the web

Today things are quite a bit more complicated. Yes jmeter & blazemeter are great tools. You might also get newrelic installed on your web nodes. This will give you instrumentation on where your app spends time. However it may still not be easy. With microservices, you have the docker container & orchestration layer to consider. In the AWS environment you can have bottlenecks on disk I/O where provisioned IOPS can help. But instance size also impacts network interfaces in the weird world of multi-tenant. So there’s that too!

Related: 5 things toxic to scalability

What’s more a lot of frameworks are starting to steer back towards ORMs again. Sadly this is not a good trend. On the flip side if you’re using RDS, your default MySQL or postgres settings may be decent. And newer versions of MySQL are getting some damn fancy & performant indexes. So there’s lots of improvement there.

Related: Anatomy of a performance review

There is also the question of simulating real users. What is a real user? What is an ACTIVE user? These are questions that may seem obvious, although I’ve worked at firms where engineering, product, sales & biz-dev all had different answers. But lets say you’ve answered that. Does are load test simply login the user? Or do they use a popular section of the site? Or how about an unpopular section of the site? Often we are guessing what “real world” users do and how they use our app.

Get more. Grab our exclusive monthly Scalable Startups. We share tips and special content. Our latest Why I don’t work with recruiters

What makes a highly valued docker expert?

via GIPHY

What exactly do we need to know about to manage docker effectively? What are the main pain points?

Join 38,000 others and follow Sean Hull on twitter @hullsean.

The basics aren’t tough. You need to know the anatomy of a Dockerfile, and how to setup a docker-compose.yml to ease the headache of docker run. You also should know how to manage docker images, and us docker ps to find out what’s currently running. And get an interactive shell (docker exec -it imageid). You’ll also make friends with inspect. But what else?

1. Manage image bloat

Docker images can get quite large. Even as you try to pair them down they can grow. Why is this?

Turns out the architecture of docker means as you add more stuff, it creates more “layers”. So even as you delete files, the lower or earlier layers still contain your files.

One option, during a package install you can do this:

RUN apt-get update && apt-get install -y mypkg && rm -rf /var/lib/apt/lists/*

This will immediately cleanup the crap that apt-get built from, without it ever becoming permanent in that layer. Cool! As long as you use “&&” it is part of that same RUN command, and thus part of that same layer.

Another option is you can flatten a big image. Something like this should work:

$ docker export 0453814a47b3 | docker import – newimage

Related: 30 questions to ask a serverless fanboy

2. Orchestrate

Running docker containers on dev is great, and it can be a fast and easy way to get things running. Plus it can work across dev environments well, so it solves a lot of problems.

But what about when you want to get those containers up into the cloud? That’s where orchestration comes in. At the moment you can use docker’s own swarm or choose fleet or mesos.

But the biggest players seem to be kubernetes & ECS. The former of course is what all the cool kids in town are using, and couple it with Helm package manager, it becomes very manageable system. Get your pods, services, volumes, replicasets & deployments ready to go!

On the other hand Amazon is pushing ahead with it’s Elastic Container Service, which is native to AWS, and not open source. It works well, allowing you to apply a json manifest to create a task. Then just as with kubernetes you create a “service” to run one or more copies of that. Think of the task as a docker-compose file. It’s in json, but it basically specifies the same types of things. Entrypoint, ports, base image, environment etc.

For those wanting to go multi-cloud, kubernetes certainly has an appeal. But amazon is on the attack. They have announced a service to further ease container deployments. Dubbed Amazon Fargate. Remember how Lambda allowed you to just deploy your *code* into the cloud, and let amazon worry about the rest? Imaging you can do that with containers, and that’s what Fargate is.

Check out what Krish has to say – Why Kubernetes should be scared of AWS

Related: What’s the luckiest thing that’s happened in your career?

3. Registries & Deployment

There are a few different options for where to store those docker images.

One choice is dockerhub. It’s not feature rich, but it does the job. There is also Quay.io. Alternatively you can run your own registry. It’s as easy as:

$ docker run -d -p 5000:5000 registry:2

Of course if you’re running your own registry, now you need to manage that, and think about it’s uptime, and dependability to your deployment pipeline.

If you’re using ECS, you’ll be able to use ECR which is a private docker registry that comes with your AWS account. I think you can use this, even if you’re not on ECS. The login process is a little weird.

Once you have those pieces in place, you can do some fun things. Your jenkins deploy pipeline can use docker containers for testing, to spinup a copy of your app just to run some unittests, or it can build your images, and push them to your registry, for later use in ECS tasks or Kubernetes manifests. Awesome sauce!

Related: Is Amazon Web Services too complex for small dev teams?

Get more. Grab our exclusive monthly Scalable Startups. We share tips and special content. Our latest Why I don’t work with recruiters