The public cloud is no longer a bleeding edge technology for the trailblazers. It’s mainstream now. As you think about it, you consider your customers and the SLAs they’ve come to expect.
It’s not if, but when to move to the cloud, how to get there, and how fast will be the transition?
Here are my thoughts on what to start thinking about.
1. Ramp up team, skills & paradigm thinking
Teams with experience in traditional datacenters have certain ways of architecting solutions, and thinking about problems. For example they may choose NFS servers to host objects, where in the cloud you will use object storage such as S3.
S3 has all sorts of new features, like lifecycle policies, and super super redundant eleven 9’s of durability. But your applications may need to be retrofitted to work with it, and your devs may need to learn about new features and functionality.
What about networking? This changes a lot in the cloud, with VPCs, and virtual appliances like NATs and Gateways. And what about security groups?
Interacting with this new world of cloud resources, requires new skillsets and new ways of thinking. So priority one will be getting your engineering teams learning, and upgrading skills. I wrote a piece about this how do I migrate my skills to the cloud?
With the old style datacenter, you typically have a firewall, and everything gets blocked & controlled. The new world of cloud computing uses security groups. These can be applied at the network level, across your VPC, or at the server level. And of course you can have many security groups with overlapping jurisdictions. Here’s how you setup a VPC with Terraform
So understanding how things work in the public cloud is quite new and challenging. There are ingress and egress rules, ways to audit with network flow logs, and more.
However again, it’s one thing to have the features available, it’s quite another to put them to proper use.
While the public cloud collectively is extremely resilient, the individual components such as EC2 instances are decidedly not reliable. It’s expected that they can and will die frequently. It’s your job as the customer to build things in a self-healing way.
That means VPCs with multiple subnets, across availability zones (multi-az). And that means redundant instances for everything. What’s more you front your servers with load balancers (classic or application). These themselves are redundant.
Whether you are building a containerized application and deploying on ECS or a traditional auto-scaling webserver with database backend, you’ll need to plan for failure. And that means code that detects, and reacts to such failures without downtime to the end user.
So there’s a learning curve. Both for your operations teams who have previously called Rackspace to get a new server provisioned. And also for your business, learning what incurs an outage, and the tricky finicky sides to managing your public cloud through code.
As you automate more and more pieces, you may have less confidence in the overall scope of your deployments. How many servers am I using right now? How many S3 buckets? What about elastic IPs?
As your automation can itself spinup new temporary environments, those resource counts will change from moment to moment. Even a spike in user engagement or a sudden flash sale, can change your cloud footprint in an instant.
That’s where heavy use of logging such as ELK (elasticsearch, logstash and kibana) can really help. Sure AWS offers CloudWatch and CloudTrail, but again you must put it all to good use.
Here’s the code to create the VPC. I’m using the Terraform community module to do this.
There are two things to notice here. One is I reference eks-region variable. Add this in your vars.tf. “us-east-1” or whatever you like. Also add cluster-name to your vars.tf.
Also notice the special tags. Those are super important. If you don’t tag your resources properly, kubernetes won’t be able to do it’s thing. Or rather EKS won’t. I had this problem early on and it is very hard to diagnose. The tags in this vpc module, with propagate to subnets, and security groups which is also crucial.
First you need to install the client on your local desktop. For me i used brew install, the mac osx package manager. You’ll also need the heptio-authenticator-aws binary. Again refer to the aws docs for help on this.
The main piece you will add is a directory (~/.kube) and edit this file ~/.kube/config as follows:
This is definitely the largest file in your terraform EKS code. Let me walk you through it a bit.
First we attach some policies to our role. These are all essential to EKS. They’re predefined but you need to group them together.
Then you need to create a security group for your worker nodes. Notice this also has the special kubernetes tag added. Be sure that it there or you’ll have problems.
Then we add some additional ingress rules, which allow workers & the control plane of kubernetes all to communicate with eachother.
Next you’ll see some serious user-data code. This handles all the startup action, on the worker node instances. Notice we reference some variables here, so be sure those are defined.
Lastly we create a launch configuration, and autoscaling group. Notice we give it the AMI as defined in the aws docs. These are EKS optimized images, with all the supporting software. Notice also they are only available currently in us-east-1 and us-west-1.
Notice also that the autoscaling group also has the special kubernetes tag. As I’ve been saying over and over, that super important.
#
# EKS Worker Nodes Resources
# * IAM role allowing Kubernetes actions to access other AWS services
# * EC2 Security Group to allow networking traffic
# * Data source to fetch latest EKS worker AMI
# * AutoScaling Launch Configuration to configure worker instances
# * AutoScaling Group to launch worker instances
#
resource "aws_iam_role" "demo-node" {
name = "terraform-eks-demo-node"
assume_role_policy = --POLICY
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Principal": {
"Service": "ec2.amazonaws.com"
},
"Action": "sts:AssumeRole"
}
]
}
POLICY
}
resource "aws_iam_role_policy_attachment" "demo-node-AmazonEKSWorkerNodePolicy" {
policy_arn = "arn:aws:iam::aws:policy/AmazonEKSWorkerNodePolicy"
role = "${aws_iam_role.demo-node.name}"
}
resource "aws_iam_role_policy_attachment" "demo-node-AmazonEKS_CNI_Policy" {
policy_arn = "arn:aws:iam::aws:policy/AmazonEKS_CNI_Policy"
role = "${aws_iam_role.demo-node.name}"
}
resource "aws_iam_role_policy_attachment" "demo-node-AmazonEC2ContainerRegistryReadOnly" {
policy_arn = "arn:aws:iam::aws:policy/AmazonEC2ContainerRegistryReadOnly"
role = "${aws_iam_role.demo-node.name}"
}
resource "aws_iam_role_policy_attachment" "demo-node-lb" {
policy_arn = "arn:aws:iam::12345678901:policy/eks-lb-policy"
role = "${aws_iam_role.demo-node.name}"
}
resource "aws_iam_instance_profile" "demo-node" {
name = "terraform-eks-demo"
role = "${aws_iam_role.demo-node.name}"
}
resource "aws_security_group" "demo-node" {
name = "terraform-eks-demo-node"
description = "Security group for all nodes in the cluster"
# vpc_id = "${aws_vpc.demo.id}"
vpc_id = "${module.eks-vpc.vpc_id}"
egress {
from_port = 0
to_port = 0
protocol = "-1"
cidr_blocks = ["0.0.0.0/0"]
}
tags = "${
map(
"Name", "terraform-eks-demo-node",
"kubernetes.io/cluster/${var.cluster-name}", "owned",
)
}"
}
resource "aws_security_group_rule" "demo-node-ingress-self" {
description = "Allow node to communicate with each other"
from_port = 0
protocol = "-1"
security_group_id = "${aws_security_group.demo-node.id}"
source_security_group_id = "${aws_security_group.demo-node.id}"
to_port = 65535
type = "ingress"
}
resource "aws_security_group_rule" "demo-node-ingress-cluster" {
description = "Allow worker Kubelets and pods to receive communication from the cluster control plane"
from_port = 1025
protocol = "tcp"
security_group_id = "${aws_security_group.demo-node.id}"
source_security_group_id = "${module.eks-vpc.default_security_group_id}"
to_port = 65535
type = "ingress"
}
data "aws_ami" "eks-worker" {
filter {
name = "name"
values = ["eks-worker-*"]
}
most_recent = true
owners = ["602401143452"] # Amazon
}
# EKS currently documents this required userdata for EKS worker nodes to
# properly configure Kubernetes applications on the EC2 instance.
# We utilize a Terraform local here to simplify Base64 encoding this
# information into the AutoScaling Launch Configuration.
# More information: https://amazon-eks.s3-us-west-2.amazonaws.com/1.10.3/2018-06-05/amazon-eks-nodegroup.yaml
locals {
demo-node-userdata = --USERDATA
#!/bin/bash -xe
CA_CERTIFICATE_DIRECTORY=/etc/kubernetes/pki
CA_CERTIFICATE_FILE_PATH=$CA_CERTIFICATE_DIRECTORY/ca.crt
mkdir -p $CA_CERTIFICATE_DIRECTORY
echo "${aws_eks_cluster.eks-cluster.certificate_authority.0.data}" | base64 -d > $CA_CERTIFICATE_FILE_PATH
INTERNAL_IP=$(curl -s http://169.254.169.254/latest/meta-data/local-ipv4)
sed -i s,MASTER_ENDPOINT,${aws_eks_cluster.eks-cluster.endpoint},g /var/lib/kubelet/kubeconfig
sed -i s,CLUSTER_NAME,${var.cluster-name},g /var/lib/kubelet/kubeconfig
sed -i s,REGION,${var.eks-region},g /etc/systemd/system/kubelet.service
sed -i s,MAX_PODS,20,g /etc/systemd/system/kubelet.service
sed -i s,MASTER_ENDPOINT,${aws_eks_cluster.eks-cluster.endpoint},g /etc/systemd/system/kubelet.service
sed -i s,INTERNAL_IP,$INTERNAL_IP,g /etc/systemd/system/kubelet.service
DNS_CLUSTER_IP=10.100.0.10
if [[ $INTERNAL_IP == 10.* ]] ; then DNS_CLUSTER_IP=172.20.0.10; fi
sed -i s,DNS_CLUSTER_IP,$DNS_CLUSTER_IP,g /etc/systemd/system/kubelet.service
sed -i s,CERTIFICATE_AUTHORITY_FILE,$CA_CERTIFICATE_FILE_PATH,g /var/lib/kubelet/kubeconfig
sed -i s,CLIENT_CA_FILE,$CA_CERTIFICATE_FILE_PATH,g /etc/systemd/system/kubelet.service
systemctl daemon-reload
systemctl restart kubelet
USERDATA
}
resource "aws_launch_configuration" "demo" {
associate_public_ip_address = true
iam_instance_profile = "${aws_iam_instance_profile.demo-node.name}"
image_id = "${data.aws_ami.eks-worker.id}"
instance_type = "m4.large"
name_prefix = "terraform-eks-demo"
security_groups = ["${aws_security_group.demo-node.id}"]
user_data_base64 = "${base64encode(local.demo-node-userdata)}"
lifecycle {
create_before_destroy = true
}
}
resource "aws_autoscaling_group" "demo" {
desired_capacity = 2
launch_configuration = "${aws_launch_configuration.demo.id}"
max_size = 2
min_size = 1
name = "terraform-eks-demo"
# vpc_zone_identifier = ["${aws_subnet.demo.*.id}"]
vpc_zone_identifier = ["${module.eks-vpc.public_subnets}"]
tag {
key = "Name"
value = "eks-worker-node"
propagate_at_launch = true
}
tag {
key = "kubernetes.io/cluster/${var.cluster-name}"
value = "owned"
propagate_at_launch = true
}
}
$ kubectl get services
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
guestbook LoadBalancer 172.20.177.126 aaaaa555ee87c... 3000:31710/TCP 4d
kubernetes ClusterIP 172.20.0.1 443/TCP 10d
redis-master ClusterIP 172.20.242.65 6379/TCP 4d
redis-slave ClusterIP 172.20.163.1 6379/TCP 4d
$
Use “kubectl get services -o wide” to see the entire EXTERNAL-IP. If that is saying you likely have an issue with your node iam role, or missing special kubernetes tags. So check on those. It shouldn’t show for more than a minute really.
Hope you got everything working.
Good luck and if you have questions, post them in the comments & I’ll try to help out!
As I was building infrastructure code, I stumbled quite a few times. You hit a wall and you have to work through those confusing and frustrating moments.
Here are a few of the lessons I learned in the process of building code for AWS. It’s not easy but when you get there you can enjoy the vistas. They’re pretty amazing.
Don’t pass credentials
As you build your applications, there are moments where components need to use AWS in some way. Your webserver needs to use S3 or your ELK box needs to use CloudWatch. Maybe you want to do an RDS backup, or list EC2 instances.
However it’s not safe to pass your access_key and secret_access_key around. Those should be for your desktop only. So how best to handle this in the cloud?
IAM roles to the rescue. These are collections of privileges. The cool thing is they can be assigned at the INSTANCE LEVEL. Meaning your whole server has permissions to use said resources.
Do this by first creating a role with the privileges you want. Create a json policy document which outlines the specific rules as you see fit. Then create an instance profile for that role.
When you create your ec2 instance in Terraform, you’ll specify that instance profile. Either by ARN or if Terraform created it, by resource ID.
Even though we know it should not happen, sometimes it does. We need to be vigilant to stay on top of this problem. There are projects like Pivotal’s credential scan. This can be used to check your source files for passwords.
What about something like RDS? You’re going to need to specify a password in your Terraform code right? Wrong! You can define a variable with no default as follows:
variable "my_rds_pass" {
description = "password for rds database"
}
When Terraform comes upon this variable in your code, but sees there is no “default” value, it will prompt you when you do “$ terraform apply”
When you first start building terraform code, chances are you create a directory, and some tf files, then do your “$ terraform apply”. When you watch that infra build for the first time, it’s exciting!
After you add more components, your code gets more complex. Hopefully you’ve created a git repo to house your code. You can check & commit the files, so you have them in a safe place. But of course there’s more to the equation than this.
How do you handle multiple environments, dev, stage & production all using the same code?
That’s where modules come in. Now at the beginning you may well have a module that looks like this:
Etc and so on. That’s the first step in the right direction, however if you change your source code, all of your environments will now be using that code. They will get it as soon as you do “$ terraform apply” for each. That’s fine, but it doesn’t scale well.
Ultimately you want to manage your code like other software projects. So as you make changes, you’ll want to tag it.
So go ahead and checkin your latest changes:
# push your latest changes
$ git push origin master
# now tag it
$ git tag -a v0.1 -m "my latest coolest infra"
# now push the tags
$ git push origin v0.1
Great now you want to modify your module slightly. As follows:
Cool! Now each dev, stage and prod can reference a different version. So you are free to work on the infra without interrupting stage or prod. When you’re ready to promote that code, checkin, tag and update stage.
You could go a step further to be more agile, and have a post-commit hook that triggers the stage terraform apply. This though requires you to build solid infra tests. Checkout testinfra and terratest.
My recent discovery is even more serious! Terraform wants to build infra. And it wants to be able to later destroy that infra. In the case of databases, obviously the previous state is one you want to keep. You want that to be perpetual, beyond the infra build. Obvious, no?
Apparently not to the folks at Amazon. When you destroy an RDS instance it will destroy all the old backups you created. I have no idea why anyone would want this. Certainly not as a default behavior. What’s worse you can’t copy those backups elsewhere. Why not? They’re probably sitting in S3 anyway!
While you can take a final backup when you destroy an RDS instance, that’s wondeful and I recommend it. However that’s not enough. I highly suggest you take matters into your own hands. Build a script that calls pg_dump yourself, and copy those .sql or .dump files to S3 for safe keeping.
As with RDS, when you create S3 buckets with your infra, you want to be able to cleanup later. But the trouble is that once you create a bucket, you’ll likely fill it with objects and files.
What then happens is when you go to do “$ terraform destroy” it will fail with an error. This makes sense as a default behavior. We don’t want data disappearing without our knowledge.
However you do want to be able to cleanup. So what to do? Two things.
Firstly, create a process, perhaps a lambda job or other bucket replication to regularly sync your s3 bucket to your permanent bucket archive location. Run that every fifteen minutes or as often as you need.
Then add a force_destroy line to your s3 bucket resource. Here’s an example s3 bucket for storing load balancer logs:
ECS is Amazon’s Elastic Container Service. That’s greek for how you get docker containers running in the cloud. It’s sort of like Kubernetes without all the bells and whistles.
It takes a bit of getting used to, but This terraform how to, should get you moving. You need an EC2 host to run your containers on, you need a task that defines your container image & resources, and lastly a service which tells ECS which cluster to run on and registers with ALB if you have one.
For each of these sections, create files: roles.tf, instance.tf, task.tf, service.tf, alb.tf. What I would recommend is create the first file roles.tf, then do:
$ terraform init
$ terraform plan
$ terraform apply
Then move on to instance.tf and do the terraform apply. One by one, next task, then service then finally alb. This way if you encounter errors, you can troubleshoot minimally, rather than digging through five files for the culprit.
This howto also requires a vpc. Terraform has a very good community vpc which will get you going in no time.
I recommend deploying in the public subnets for your first run, to avoid complexity of jump box, and private IPs for ecs instance etc.
Good luck!
May the terraform force be with you!
First setup roles
Roles are a really brilliant part of the aws stack. Inside of IAM or identity access and management, you can create roles. These are collections of privileges. I’m allowed to use this S3 bucket, but not others. I can use EC2, but not Athena. And so forth. There are some special policies already created just for ECS and you’ll need roles to use them.
These roles will be applied at the instance level, so your ecs host doesn’t have to pass credentials around. Clean. Secure. Smart!
Next you need EC2 instances on which to run your docker containers. Turns out AWS has already built AMIs just for this purpose. They call them ECS Optimized Images. There is one unique AMI id for each region. So be sure you’re using the right one for your setup.
The other thing that your instance needs to do is echo the cluster name to /etc/ecs/ecs.config. You can see us doing that in the user_data script section.
Lastly we’re configuring our instance inside of an auto-scaling group. That’s so we can easily add more instances dynamically to scale up or down as necessary.
#
# the ECS optimized AMI's change by region. You can lookup the AMI here:
# https://docs.aws.amazon.com/AmazonECS/latest/developerguide/ecs-optimized_AMI.html
#
# us-east-1 ami-aff65ad2
# us-east-2 ami-64300001
# us-west-1 ami-69677709
# us-west-2 ami-40ddb938
#
#
# need to add security group config
# so that we can ssh into an ecs host from bastion box
#
#
# register the cluster name with ecs-agent which will in turn coord
# with the AWS api about the cluster
#
user_data = <> /etc/ecs/ecs.config
EOF
}
#
# need an ASG so we can easily add more ecs host nodes as necessary
#
resource "aws_autoscaling_group" "ecs-autoscaling-group" {
name = "ecs-autoscaling-group"
max_size = "2"
min_size = "1"
desired_capacity = "1"
The third thing you need is a task. This one will spinup a generic nginx container. It’s a nice way to demonstrate things. For your real world usage, you’ll replace the image line with a docker image that you’ve pushed to ECR. I’ll leave that as an exercise. Once you have the cluster working, you should get the hang of things.
Note the portmappings, memory and CPU. All things you might expect to see in a docker-compose.yml file. So these tasks should look somewhat familiar.
The fourth thing you need to do is setup a service. The task above is a manifest, describing your containers needs. It is now registered, but nothing is running.
When you apply the service your container will startup. What I like to do is, ssh into the ecs host box. Get comfortable. Then issue $ watch "docker ps". This will repeatedly run "docker ps" every two seconds. Once you have that running, do your terraform apply for this service piece.
As you watch, you'll see ECS start your container, and it will suddenly appear in your watch terminal. It will first show "starting". Once it is started, it should say "healthy".
The above will all work by itself. However for a real-world use case, you'll want to have an ALB. This one has only a simple HTTP port 80 listener. These are much simpler than setting up 443 for SSL, so baby steps first.
Once you have the ALB going, new containers will register with the target group, to let the alb know about them. In "docker ps" you'll notice they are running on a lot of high numbered ports. These are the hostPorts which are dynamically assigned. The container ports are all 80.
#
#
resource "aws_alb_target_group" "test" {
name = "my-alb-group"
port = 80
protocol = "HTTP"
vpc_id = "${module.new-vpc.vpc_id}"
}
default_action {
target_group_arn = "${aws_alb_target_group.test.id}"
type = "forward"
}
}
You will also want to add a domain name, so that as your infra changes, and if you rebuild your ALB, the name of your application doesn't vary. Route53 will adjust as terraform changes are applied. Pretty cool.
resource "aws_route53_record" "myapp" {
zone_id = "${aws_route53_zone.primary.zone_id}"
name = "myapp.mydomain.com"
type = "CNAME"
ttl = "60"
records = ["${aws_alb.main.dns_name}"]
Please change the subnet to a valid one for you. In the real world you would definitely *not* hardcode a subnet like this. But I wanted to keep this example very simple. Don’t know what subnet to use? Navigate your aws dashboard over to “VPC” and dig around.
Also of course edit for your key.
Ok, you’re ready to test. Let’s first ask terraform what it will do with the “plan” command:
levanter:terraform sean$ terraform plan
Refreshing Terraform state in-memory prior to plan...
The refreshed state will be used to calculate this plan, but
will not be persisted to local or remote state storage.
The Terraform execution plan has been generated and is shown below.
Resources are shown in alphabetical order for quick scanning. Green resources
will be created (or destroyed and then created if an existing resource
exists), yellow resources are being changed in-place, and red resources
will be destroyed. Cyan entries are data sources to be read.
Note: You didn't specify an "-out" parameter to save this plan, so when
"apply" is called, Terraform can't guarantee this is what will execute.
+ aws_instance.example
ami: "ami-40d28157"
availability_zone: ""
ebs_block_device.#: ""
ephemeral_block_device.#: ""
instance_state: ""
instance_type: "t2.micro"
key_name: "seanKey"
network_interface_id: ""
placement_group: ""
private_dns: ""
private_ip: ""
public_dns: ""
public_ip: ""
root_block_device.#: ""
security_groups.#: ""
source_dest_check: "true"
subnet_id: "subnet-111ddaaa"
tenancy: ""
vpc_security_group_ids.#: ""
Plan: 1 to add, 0 to change, 0 to destroy.
levanter:terraform sean$
Next you want to ask terraform to go ahead and do the work. Because above we only did a dry-run.
levanter:terraform sean$ terraform apply
aws_instance.example: Creating...
ami: "" => "ami-40d28157"
availability_zone: "" => ""
ebs_block_device.#: "" => ""
ephemeral_block_device.#: "" => ""
instance_state: "" => ""
instance_type: "" => "t2.micro"
key_name: "" => "seanKey"
network_interface_id: "" => ""
placement_group: "" => ""
private_dns: "" => ""
private_ip: "" => ""
public_dns: "" => ""
public_ip: "" => ""
root_block_device.#: "" => ""
security_groups.#: "" => ""
source_dest_check: "" => "true"
subnet_id: "" => "subnet-111ddaaa"
tenancy: "" => ""
vpc_security_group_ids.#: "" => ""
aws_instance.example: Still creating... (10s elapsed)
aws_instance.example: Still creating... (20s elapsed)
aws_instance.example: Creation complete
Apply complete! Resources: 1 added, 0 changed, 0 destroyed.
The state of your infrastructure has been saved to the path
below. This state is required to modify and destroy your
infrastructure, so keep it safe. To inspect the complete state
use the `terraform show` command.
State path: terraform.tfstate
levanter:terraform sean$
One thing I like is terraform shows us the progress at command line. Cloudformation isn’t so nicely finished. 🙂
Ok, let’s add a tag name to our server. We’re going to add just three lines to our main.tf file:
provider "aws" {
region = "us-east-1"
}
resource "aws_instance" "example" {
ami = "ami-40d28157"
subnet_id = "subnet-111ddaaa"
instance_type = "t2.micro"
tags {
Name = "terraform-box"
}
}
Now we do terraform apply again. Look how easy that change is to make!
levanter:terraform sean$ terraform apply
aws_instance.example: Refreshing state... (ID: i-0ddd063bbbbce56e2)
aws_instance.example: Modifying...
tags.%: "0" => "1"
tags.Name: "" => "terraform-box"
aws_instance.example: Modifications complete
Apply complete! Resources: 0 added, 1 changed, 0 destroyed.
The state of your infrastructure has been saved to the path
below. This state is required to modify and destroy your
infrastructure, so keep it safe. To inspect the complete state
use the `terraform show` command.
State path: terraform.tfstate
levanter:terraform sean$
Navigate to the EC2 dashboard and you should see the first column showing your new name.
That was cool!
Chances are you don’t wanna leave these components sitting around. Let’s cleanup. That’s easy too!
levanter:terraform sean$ terraform destroy
Do you really want to destroy?
Terraform will delete all your managed infrastructure.
There is no undo. Only 'yes' will be accepted to confirm.
Enter a value: yes
aws_instance.example: Refreshing state... (ID: i-0ddd063bbbbce56e2)
aws_instance.example: Destroying...
aws_instance.example: Still destroying... (10s elapsed)
aws_instance.example: Still destroying... (20s elapsed)
aws_instance.example: Still destroying... (30s elapsed)
aws_instance.example: Still destroying... (40s elapsed)
aws_instance.example: Still destroying... (50s elapsed)
aws_instance.example: Still destroying... (1m0s elapsed)
aws_instance.example: Destruction complete
Destroy complete! Resources: 1 destroyed.
levanter:terraform sean$
Since I didn’t define “outputs” to keep the yaml simple, the command above should just return without error.
You can go into the aws dashboard, and navigate to “CloudFormation” and see the stack being created. You can also see under “EC2” a new instance has been created.
Terraform just supports JSON or it’s HCL (hashicorp configuration language). Actually the latter way of formatting is better supported.
On the CloudFormation side you can use yaml or json.
However CloudFormation can be clunky and frustrating to work with. For example to dry-run in terraform is easy. Just use “plan”. And isn’t something we’re going to do over and over?
In CloudFormation there is a “validate-template” option, but this just checks your JSON or YAML. It doesn’t hit amazon’s API or test things in any real way. They have added something called Change Sets, but I haven’t tried them too much yet.
Also CloudFormations error messages are really lacking. They often give you a syntax error or tell you a resource is incomplete without real details on where or how. It makes debugging slow and tedious. Sometimes I see errors at create-stack calls. Other times that succeeds only to find errors within the CloudFormation dashboard.
She went on to say how much has changed in the last decade. We talked about how the database administrator, as a career role, wasn’t really being hired for much these days. Things had changed. Evolved a lot.
How do you keep up with all the new technology, she asked?
I went on to talk about Amazon RDS, EC2, lambda & serverless as really exciting stuff. And lets not forget terraform (I wrote a howto on terraform), ansible, jenkins and all the other deployment automation technologies.
We talked about Redshift too. It seems to be everywhere these days and starting to supplant hadoop as the warehouse of choice for analytics.
It was a great conversation, and afterward I decided to summarize my thoughts. Here’s how I think automation and the cloud are impacting the dba role.
My career pivots
Over the years I’ve poured all those computer science algorithms, coding & hardware skills into a lot of areas. Tools & popular language change. Frameworks change. But solid deductive reasoning remains priceless.
o C++ Developer
Fresh out of college I was doing Object Oriented Programming on the Macintosh with Codewarrior & powerplant. C++ development is no joke, and daily coding builds strength in a lot of areas. Turns out he application was a database application, so I was already getting my feet wet with databases.
o Jack of all trades developer & Unix admin
One type of job role that I highly recommend early on is as a generalist. At a small startup with less than ten employees, you become the primary technology solutions architect. So any projects that come along you get your hands dirty with. I was able to land one of these roles. I got to work on Windows one day, Mac programming another & Unix administration & Oracle yet another day.
o Oracle DBA
The third pivot was to work primarily on Oracle. I attended Oracle conferences & my peers were Oracle admins. Interestingly, many of the Oracle “experts” came from more of a business background, not computer science. So to have a more technical foundation really made you stand out.
For the startups I worked with, I was a performance guru, scalability expert. Managers may know they have Oracle in the mix, but ultimately the end goal is to speed up the website & make the business run. The technical nuts & bolts of Oracle DBA were almost incidental.
o MySQL & Postgres
As Linux matured, so did a lot of other open source projects. In particular the two big Open Source databases, MySQL & Postgres became viable.
Suddenly startups were willing to put their businesses on these technologies. They could avoid huge fees in Oracle licenses. Still there were not a lot of career database experts around, so this proved a good niche to focus on.
o RDS & Redshift on Amazon Cloud
Fast forward a few more years and it’s my fifth career pivot. Amazon Web Services bursts on the scene. Every startup is deploying their applications in the cloud. And they’re using Amazon RDS their managed database service to do it. That meant the traditional DBA role was less crucial. Sure the business still needed data expertise, but usually not as a dedicated role.
Time to shift gears and pour all of that Linux & server building experience into cloud deployments & migrating to the cloud.
o Devops, data, scalability & performance
Now of course the big sysadmin type role is usually called an SRE or Devops role. SRE being site reliability engineer. New name but many of the same responsibilities.
Now though infrastructure as code becomes front & center. Tools like CloudFormation & Terraform, plus Ansible, Chef & Jenkins are all quite mature, and being used everywhere.
Checkout your infrastructure code from git, and run terraform apply. And minutes later you have rebuilt your entire stack from bare metal to fully functioning & autoscaling application. Cool!
For DBAs who are looking at the cloud from the old way of doing things, there’s a lot to love about it.
Automation brings repeatability to work & jobs. This is great. It raises the bar & makes us more professional, reducing manual processes & mistakes.
Infrastructure as code is self documenting. It means we have a better idea of day-to-day processes, and can more easily handoff to new folks as we change roles or companies.
However these days cloud, automation & microservices have brought a lot of madness too! Don’t believe me check out this piece on microservice madness.
With microservices you have more databases across the enterprise, on more platforms. How do you restore all at the same time? How do you do point-in-time recovery? What if your managed service goes down?
Migration scripts have become popular to make DDL changes in the database. Going forward (adding columns or tables) is great. But should we be letting our deployment automation roll *BACK* DDL changes? Remember that deletes data right? 🙂
What about database drop & rebuild? Or throwing databases in a docker container? No bueno. But we’re seeing this more and more. New performance problems are cropping up because of that.
What about when your database upgrades automatically? Remember when you use a managed service, it is build for 1000 users, not one. So if your use case is different you may struggle.
In my experience upgrading RDS was a nightmare. Database as a service upgrades lack visibility. You don’t have OS or SSH access so you can’t keep track of things. You just simply wait.
No longer do we have “zero downtime”. With amazon RDS you have guarenteed downtime upgrades. No seriously.
As the field of databases fragments, we are wearing many more hats. If you like this challenge & enjoy being a generalist, you may feel at home here. But it is a long way from one platform one skill set career path.
Also fragmented db platforms means more complex recovery. I can’t stress this enough. It would become practically impossible to restore all microservices, all their underlying databases & all systems to one single point in time, if you need to.
As the DBA role evolves, it also brings great opportunity. For those with solid database & data skills are sorely in need at startups and many fortune 500 organizations.
What I’m seeing is that organizations have lost much of the discipline they had as separate dba or operations departments. Schemaless databases have proliferated, and performance has suffered.
All these are more complex now, but strong DBA, performance & troubleshooting skills are needed now more than ever.
A recent NYT piece on our aging american infrastructure got me thinking. It seems that roads, bridges, airports & city sewer systems are all in need of repair. Sadly as budgets to maintain these systems in good repair are often short, they become larger problems to fix as their status becomes critical.
“Americans have an impoverished and immature conception of technology, one that fetishizes innovation as a kind of art and demeans upkeep as a mere drudgery.”
I’m not sure this is an American-only phenomenon. However I do see it a lot with technology companies & startups.
1. Do we have to manage ops in the cloud?
The cloud has enabled infrastructure automation in some pretty phenomenal ways. Code pipelines can deliver changes to a repo, through automated unit testing, and out to customers all without human intervention. This makes teams more agile, and ultimately businesses faster & more profitable.
We might be distracted enough to stop worrying about operations altogether. After all Amazon knows how to manage broken servers & alert us right? I write do we have to manage operations in the cloud previously, as this sentiment seems to be growing.
Modern applications have a ton of interdependencies. Even with decent integration testing, the full stack is complex, and requires monitoring. Co-tenancy can complicate your performance tuning efforts as neighboring customers may directly affect your application. Third party services may be delivered from smaller or less experienced companies, whose SLA may be limiting besides. And hey if Amazon goes down, I can just tell my customers it was their fault, right?
Chances are I’m guessing you’ll say no. He was part of the original Facebook team alongside Zuckerberg. You don’t know his name? He had the sexy job of, you guessed it maintenance! He was the operations guy. Did he write the application code? More than likely he knew that code very well as he had to fix & maintain it. Along with the infrastructure to scale & support Facebook’s massive growth.
Ward Cunningham has an excellent interview about technical debt. Is a little bit ok? Maybe. But each amount is kicking the can down the road. As the NYT article on maintenance makes clear, you can move the responsibility on to the next administration, the next term, or someone else, but eventually you’ll have a critical problem on your hands, which will be much more expensive to fix.
As they explored options, they found an alternative that might save them a boatload of money. They considered switching to an IBM alternative called Cobalt.
And I mean this is Silicon Valley, what could go wrong?
In the case of Palantir, they claim to be an open system. And of course this is good marketing. Essential in fact to get the contract. Promise that it’s easy to switch. Don’t dig too deep into the technical details there. According to the article, Palantir spokeperson claims:
“Palantir is an open platform. As with all our customers, their data & analysis are available to them at all times in an open & nonproprietary format.”
And that does appear to be true. What appears to be troubling NYPD isn’t that they can’t get the analysis, for that’s available to them in perpetuity. Within the Palantir system. But getting access to how the analysis is done, well now that’s the secret sauce. Palantir of course is not going to let go of that.
And that’s the devil in the details when you want to switch to a competing service.
Although the NYPD can get their data into & out of the Palantir system easily, that’s just referring to the raw data. That’s the data they ingested in the first place, arrest records, license plate reads, parking tickets, stuff like that.
“This notion of how portable your data is when you engage in a contract with a platform is really, really complex, and hasn’t really been tested” – Tal Klein
Palantir’s secret sauce, their intellectual property, is finding the needle in the haystack. What pieces of data are relevant & how can I present the detectives the right information at the right time.
Analysis *is* the algorithms. It’s the big data 64 million dollar question. Or in this case $3.5 million per year, as the contract is reported to be worth!
The web is bringing us great platforms, like google & amazon cloud. It’s bringing a myriad of AI solutions to our fingertips. Palantir is providing a push button solution to those in need of insights like the NYPD.
The Cobalt solution that IBM is offering goes the other way. Build it yourself, manage it, and crucially control it. And that’s the difference.
It remains to be seen how the rush to migrate the universe of computing to Amazon’s own cloud will settle out. Right now their in a growth phase, so it’s all about lowering prices. But at some point their market muscle will mean they can go the Oracle route a la Larry Ellison. That’s why customers start feeling the squeeze.
If the NYPD example is any indication, it could get ugly!
One of the things that is exciting about the cloud is the reduced need for operations staff. There seem to be two drivers of this trend. One is devops, and all the automation that comes with it. As we formalize configurations, things become repeatable, and fewer people can manage greater armies of servers.
The second is by moving to a cloud hosting provider, we essentially outsource the operations to their team.
1. Pretty abstractions? still hardware buried somewhere
That’s right, beneath all the virtual EC2 instances & VPCs there is physical hardware. Huge datacenters sit in North Virginia, Oregon, Ireland, London and many other cities. Within them there are racks upon racks of servers. The hypervisor layer, the abstraction built on top of that, orchestrates everything.
Although we outsource the management of those datacenters to Amazon, there are still responsibilities we carry. Let’s dig into those more.
These days we see the demand for a full stack developer. That is someone who does not only front end dev, but also backend. In turn, they are often asked to wear the had of ops. Spinup EC2 instance, decide on the capacity & size, choose proper disk I/O, place it within the right subnet & vpc & then configure the security groups properly.
All of these tasks would previously been managed by a dedicated ops team, but now those responsibilities are being put on developers shoulders. In some cases, such as with microservices, devs also carry the on-call duties of their applications.
Lastly there is likely ops to handle automation. Devops will formalize configurations, into ansible playbooks or chef recipes, so they can be checked into version control. At this point infrastructure can even be unit tested.
In previous eras, ops teams would be heavily involved with design of applications & architecture to support that. Now that may be handed to devs, but it still needs to happen.
Furthermore resiliency is said to be the customers responsibility. In the pre-cloud days, hardware was more reliable. It had a slower failure rate. With virtual machines, they’re expected to fail, and all the components to make your applications resilient are given to you. But it’s your job to architect them together.
That means your applications need to be self-healing. Failures need to be detected, taken out of autoscaling groups, and replaced. All automatically. Code or not, that is certainly operations.
I’ve seen quite a few outages in the past year, from Dropbox to Airbnb, and DYN themselves. Ultimately these outages could be tied back to a failure with Amazon. But when your business customers are relying on your service, it is *YOUR* business that answers to it’s own SLA.
In the news we see many of these firms pointing the finger at Amazon, “hey it’s not our fault, our cloud provider went down!”. Ultimately your customers don’t care. They don’t want excuses. If using multiple regions in AWS is not sufficient, you’ll need to build your application to be multi-cloud.
Remember, while you outsource your operations to Amazon, you’re getting very professional management of those systems. However they will optimize for their many customers. Your particular problems are less of a concern.
6. Only you can thinking holistically about interdependancies
Your application more than likely uses a number of APIs to capture data, perhaps do single sign on or even a third party database like Firebase. It’s your responsibility to do integration testing. All that becomes more complex in the cloud.
SaaS solutions are everywhere now. auth0, firebase and an infinite variety of third party apis complicate reliability, security, storage, performance, integration testing & debugging?
Security is a traditional responsibility held by the operations hat. Much of that becomes more complex in the cloud. With serverless applications for example you may use a few APIs, plus an authentication broker, and a backend database. As this list of services grows, the code you write may decrease. But testing & securing it all becomes much more complex.
With more services like this, the attack vector or surface area becomes greater. Each of those services, can and will have bugs. What if a zero day is found in the authentication broker, allowing a hacker to break into a broad cross section of applications across the internet? How do you discover this? What if your vendor hasn’t found out yet?
Back to point #1 above, all these virtual servers sit on real physical servers. That affects customers in two ways. One you may be sharing the same host. That is if you use a very small vm, it may sit along side another customer with a small vm. If those eat up CPU cycles or network on that box, neighbors or co-tenants will suffer.
There are many other instance types where you get your own dedicated hardware. With those you have your own nic as well, so no competition. Except wait there’s network storage! That’s right all the machines in the AWS environment use EBS now, which is all co-tenant. So your data is sitting alongside other customers, and you are all fighting for usage of the same disk read heads.
One way to mitigate this is to configure specific provisioned IOPS for your servers. But that costs more. It’s normally reserved for database instances where disk I/O is really crucial.
Granted the NewRelics of the world will certainly help us with this process. But they’re not giving us a hypervisor or global view of those servers, network or storage. So we can’t see how the overall systems performance may be impacted.
When security is done well, you don’t have breakins, you don’t have data stolen, everything just runs smoothly. Operations is like that too. When it is done well it can be invisible.
It can also be invisible in a different way. When you deploy your application on serverless, all the servers & autoscaling is completely abstracted away. So when you get some weird outage because the farm of servers is offline, or because you hit some account limit in the number of functions you can run at once, then it quickly comes into focus.
Beware of invisible operations, because it’s harder to see what to monitor, and know how to stay ahead of outages.
At the end of the day you can’t outsource ownership of your application or your business. The holistic view of your application in totality can only be understood by your engineers.
And that in the end is what operations is all about, no matter who’s wearing the hat!
Amazon is about to launch a product called glue. As you can see below, this is the last piece in the data warehousing puzzle. With that in place, Amazon will own you! Or at least have push button products to meet all of enterprises varying needs.
Even if you’re a small startup, you can do big-shot big enterprise data warehousing. That means everyone can use cutting edge data driven techniques for product & business decisions.
Redshift is like the OLAP databases of years past, the Oracle’s of the world purpose built for warehousing data. Obviously without the crazy licensing model Oracle was famous for. With Amazon you can get enterprise class data warehouse for modest hourly prices.
Spectrum is a very new extension of Redshift allowing you to access & query S3 file data directly. This means you can have petabytes of data that you can access pre-load time. So you will ETL and load portions of it, but with Spectrum you can still access the offline data too.
In the old Oracle days this was called an EXTERNAL TABLE. I mention this only to say that Amazon isn’t doing anything that hasn’t been done before. Rather they’re bringing these advanced features within reach of everyday startups. That’s cool.
Glue is still in beta, but if the RE:Invent talk above is any indication, it’s set to disrupt an entire industry. Wow!
Glue first catalogs your data sources. What does this mean, it scans them & models their schemas.
It then generates sample python ETL code. Modify it, or write your own. Share your code on Git. Or borrow other open source pieces, that already address your specific ETL use case!
Lastly it includes a job scheduler which handles dependencies. Job A must be completed before B can run and so forth. Error handling & logging are also all included.
Serverless means deploying functions directly into the cloud. No servers, no configuration. All the systems administration & automation is hidden. No more devops to argue with! Amazon’s own offering is called Lambda.