We’ve all heard the success stories at firms that have grappled with automation. The dividends are legendary.
Take Amazon themselves for example. By decoupling their teams, allowing each to grow independently and at their own pace, they’ve been able to scale massively.
Join 38,000 others and follow Sean Hull on twitter @hullsean.
One look at the AWS dashboard these days, or their wikipedia page, reveals over 90 services on offer. And each of those is growing and expanding by day.
I’ve worked with a lot of startups, trying to get there. They’ve heard the gospel, and want to gain the benefits themselves.
Here are the challenges I’ve found.
1. Building ain’t easy
One example story was building an ELK box. ELK is elasticsearch, Logstash and Kibana. It provides a centralized place to send all your application & service logs, collect them all together on one dashboard. It’s the business intelligence of devops & software development. Super valuable tool.
In building our solution, we took a marketplace AMI off the shelf, and then customized that. After building the terraform code to spinup the server, we added Ansible scripts to further customize. This allowed us to add a cronjob for backups, set a password, add additional logstash configs, and a few other important housekeeping tasks.
All was great until we hit a snag, we found some CloudWatch logs were not making there way into ELK. Digging through the log messages, we eventually uncovered an error. And that was caused by a conflicting port configuration. So we removed that unused in logstash.conf, and problem solved.
Later, we rebuilt the server and that was pretty quick. Having all the scripts in place, meant we could rebuild quickly. In this case we just needed to resize the root volume by 25x to make room for future logs. This was 3 lines of terraform code and then done!
A couple of weeks later however, we found missing logs again. Digging digging digging, and then we finally discover it is a repeat of our old problem! Turns out the change to logstash.conf never got rolled into the automation scripts. It was done manually! Bad bad!
Moral of the story, with automation, your workflow needs to change. You should *always be working on the scripts* and then reapplying those. Never work on the server directly!
Time to eat my own dogfood!
2. Troubleshooting is tough
In the automation universe, as I wrote above, you really want to avoid logging into servers and doing things manually. But that may be easier said than done.
Take another example, I had an ssh key distribution script. I repurposed from the Terraform Community Modules. It works great when it works. It gets injected onto the server at boot time, by terraform inside the user-data script.
The code gets added to cron, and relies on awscli. As it turns out awscli is *not* on all of the aws linux images. Who knows why?!? But that’s where we are.
Should be easy to install. Use yum to get pip (python package manager) installed. Then use pip to install awscli. The script even has *both* yum and apt-get commands to attempt to install pip on either ubuntu or amzn linux. Problem is sometimes it doesn’t. Sometimes? You ask. Yes indeed.
Digging further, it seems that the new pip package gets installed in /usr/local/bin, while it used to install in /usr/bin/. Seems simple. Add a symlink. Yeah did that. Sometimes the package has a different name, such as python-pip3. Great!
Now all this is magnified because you can’t just go on the box and go through the steps. Why? Because in the primordial cromagnon universe that is linux server boot time, sometimes things happen in weird orders, or slower. So you may have something missing during that period, that is later available. So after boot you see no errors.
Yes complicated. Yes you need to build, destory, build destroy the server in endless cycles.
At the next level of automation, we will implement infrastructure testing pipeline. This will automatically build the server for you. The infrastructure unit testing framework seems pretty darn cool. And there is also Gruntworks Terratest.
3. The dividend is agility
What have i seen in terms of agility?
Well moving our application to a new region takes 20 minutes. Crazy as that sounds, from vpc, to 3 private subnets, 3 public subnets, bastion boxes, load balancers, rds & redis instances, security groups, ingress rules, iam roles, users, s3 buckets, ecs cluster, and various ec2 instances, route 53 zones & cnames, plus even EIPs all can be moved with a few simple code changes. Wow!
What else? We can resize our ELK box root volume by deploying a brand new setup, all in about ten minutes.
This kind of speed is so exciting. It brings repeatability to your engineering processes. It brings confidence to all of those components.
And best of all it allows the business to experiment with new product ideas, and accelerate in the marketplace.
And we all know what that means!
Related: I have a new appreciation for AGILE