I recently ran across this interesting question on a technology forum.
“I’m an engineering team lead at a startup in NYC. Our app is written in Ruby on Rails and hosted on Heroku. We use metrics such as the built-in metrics on Heroku, as well as New Relic for performance monitoring. This summer, we’re expecting a large influx of traffic from a new partnership and would like to have confidence that our system can handle the load.”
“I’ve tried to wrap my head around different types of performance/load testing tools like JMeter, Blazemeter, and others. Additionally, I’ve experimented with scripts which have grown more complex and I’m following rabbit holes of functionality within JMeter (such as loading a CSV file for dynamic user login, and using response data in subsequent requests, etc.). Ultimately, I feel this might be best left to consultants or experts who could be far more experienced and also provide our organization an opportunity to learn from them on key concepts and best practices.”
I’ve been doing performance tuning since the old dot-com days.
It used to be you point a loadrunner type tool at your webpage and let it run. Then watch the load, memory & disk on your webserver or database. Before long you’d find some bottlenecks. Shortage of resources (memory, cpu, disk I/O) or slow queries were often the culprit. Optimizing queries, and ripping out those pesky ORMs usually did the trick.
Today things are quite a bit more complicated. Yes jmeter & blazemeter are great tools. You might also get newrelic installed on your web nodes. This will give you instrumentation on where your app spends time. However it may still not be easy. With microservices, you have the docker container & orchestration layer to consider. In the AWS environment you can have bottlenecks on disk I/O where provisioned IOPS can help. But instance size also impacts network interfaces in the weird world of multi-tenant. So there’s that too!
What’s more a lot of frameworks are starting to steer back towards ORMs again. Sadly this is not a good trend. On the flip side if you’re using RDS, your default MySQL or postgres settings may be decent. And newer versions of MySQL are getting some damn fancy & performant indexes. So there’s lots of improvement there.
There is also the question of simulating real users. What is a real user? What is an ACTIVE user? These are questions that may seem obvious, although I’ve worked at firms where engineering, product, sales & biz-dev all had different answers. But lets say you’ve answered that. Does are load test simply login the user? Or do they use a popular section of the site? Or how about an unpopular section of the site? Often we are guessing what “real world” users do and how they use our app.
ECS is Amazon’s elastic container service. If you have a dockerized app, this is one way to get it deployed in the cloud. It is basically an Amazon bootleg Kubernetes clone. And not nearly as feature rich! 🙂
That said, ECS does work, and it will allow you to get your application going on Amazon. Soon enough EKS (Amazon’s Kubernetes service) will be production, and we’ll all happily switch.
Meantime, if you’re struggling with the weird errors, and when it is silently failing, I have some help here for you. Hopefully these various error cases are ones you’ve run into, and this helps you solve them.
Why is my container in a stopped state?
Containers can fail for a lot of different reasons. The litany of causes I found were:
o port mismatches
o missing links in the task definition
o shortage of resources (see #2 below)
When ecs repeatedly fails, it leaves around stopped containers. These eat up system resources, without much visible feedback. “df -k” or “df -m” doesn’t show you volumes filled up. *BUT* there are logical volumes which can fill.
Do this to see the status:
[[email protected] ~]# lvdisplay
--- Logical volume ---
LV Name docker-pool
VG Name docker
LV UUID aSSS-fEEE-d333-V999-e999-a000-t11111
LV Write Access read/write
LV Creation host, time ip-10-111-40-30, 2018-04-21 18:16:19 +0000
LV Pool metadata docker-pool_tmeta
LV Pool data docker-pool_tdata
LV Status available
# open 3
LV Size 21.73 GiB
Allocated pool data 18.81%
Allocated metadata 6.10%
Current LE 5562
Read ahead sectors auto
- currently set to 256
Block device 253:2
When a service is run, ECS wants to have *all* of the containers running together. Just like when you use docker-compose. If one container fails, ecs-agent may decide to kill the entire service, and restart. So you may see weird things happening in “docker logs” for one container, simply because another failed. What to do?
First look at your task definition, and set “essential = false”. That way if one fails, the other will still run. So you can eliminate the working container as a cause.
Next thing is remember some containers may startup almost instantly, like nginx for example. Because it is a very small footprint, it can start in a second or two. So if *it* depends on another container that is slow, nginx will fail. That’s because in the strange world of docker discovery, that other container doesn’t even exist yet. While nginx references it, it says hey, I don’t see the upstream server you are pointing to.
Solution? Be sure you have a “links” section in your task definition. This tells ecs-agent, that one container depends on another (think of the depends_on flag in docker-compose).
As you are building your ecs manifest aka task definition, you want to run through your docker-compose file carefully. Review the links, essential flags and depends_on settings. Then be sure to mirror those in your ECS task.
When in doubt, reduce the scope of your problem. That is define *only one* container, then start the service. Once that container works, add a second. When you get that working as well, add a third or other container.
This approach allows you to eliminate interconnecting dependencies, and related problems.
I guess I enjoy posting and sharing knowledge there, because many of the questions seem so familiar, as ones that I pondered at some point or other along the way.
Here are a few questions, and the answers I shared.
1. I’m lacking direction
“Paths will change. Just finish. Which degree you end up with does not mather. It will be how you apply it. Gates, Dell, Zuckerberg & Jobs all quit school right where you are now to pursue real world business.”
2. Overwhelmed with amount of work, I want to quit my job. Advice?
“Long story short. My current manger is completely unrealistic in terms of amount of done to be in a short amount of time. I’ve keep working overtime every day and weekends to fix issues and I’m tired of it. I want to go to some training and take vacation soon and while my company approved my training, I’m afraid to ask for vacation while this project needs to be done ASAP.
I’m just feeling so overworked that I really need vacation ASAP.
I’m so tempted to quit my job or start another job hunt”
“I found myself in a similar situation a few years ago. And a colleague advised me “Sean, sometimes you have to let things break a little”. This seemed incredibly odd advice. however when i tried it i was very surprised. Management didn’t “blame it all on me” as i expected they would. In fact they didn’t blame anything on me merely adjusted their timelines.
Lesson learned, we cannot carry the entire org problems on our own shoulders. And no one is expecting us to.”
3. Possible red flags in startup? How can I know for sure?
“Basically my question is this: what questions should I be asking to know what I need to know? The main thing I’m afraid of is the engineering manager treating the engineers like dogshit, where we work insane hours and don’t really have control over what we work on. How can I coax that information out of them?”
“i would trust your gut. i have worked in companies that were all over the place organizationally but there were no weird tells on glassdoor.
Also as far as hours you are expected to work, keep the emails for documentation. remember there are also labor laws protecting w2 employees so you’re fine. just leave at 5 :)”
4. I accepted a job, then got a better offer from another company
“So…I accepted and started a new job…but 1.5 weeks later I hear back with an even better offer from a larger company I applied to 4 months ago.”
“It is a tough position to be in, but also a “good problem to have”.
Don’t burn bridges. but business is business as they say. you could ask if they want to counteroffer. but there may be bad blood now.”
“I landed a job with a big name company (non-Big 4). The offered pay is a solid 25% jump from my old job, the team is what I’ve been looking for, 10% annual bonus, etc. Should ask for more money? Every dime will obviously make my life easier and I certainly don’t want to fall behind on my career pay as a whole.”
One reader’s response:
“If it makes you uncomfortable to ask for more money, You might also have a think about what you really value and negotiate for that instead. Maybe thats two extra weeks of vacation?”
“Two extra weeks vacation is money. Negotiate the best you can. Only you can advocate for yourself”
The basics aren’t tough. You need to know the anatomy of a Dockerfile, and how to setup a docker-compose.yml to ease the headache of docker run. You also should know how to manage docker images, and us docker ps to find out what’s currently running. And get an interactive shell (docker exec -it imageid). You’ll also make friends with inspect. But what else?
1. Manage image bloat
Docker images can get quite large. Even as you try to pair them down they can grow. Why is this?
Turns out the architecture of docker means as you add more stuff, it creates more “layers”. So even as you delete files, the lower or earlier layers still contain your files.
One option, during a package install you can do this:
This will immediately cleanup the crap that apt-get built from, without it ever becoming permanent in that layer. Cool! As long as you use “&&” it is part of that same RUN command, and thus part of that same layer.
Another option is you can flatten a big image. Something like this should work:
Running docker containers on dev is great, and it can be a fast and easy way to get things running. Plus it can work across dev environments well, so it solves a lot of problems.
But what about when you want to get those containers up into the cloud? That’s where orchestration comes in. At the moment you can use docker’s own swarm or choose fleet or mesos.
But the biggest players seem to be kubernetes & ECS. The former of course is what all the cool kids in town are using, and couple it with Helm package manager, it becomes very manageable system. Get your pods, services, volumes, replicasets & deployments ready to go!
On the other hand Amazon is pushing ahead with it’s Elastic Container Service, which is native to AWS, and not open source. It works well, allowing you to apply a json manifest to create a task. Then just as with kubernetes you create a “service” to run one or more copies of that. Think of the task as a docker-compose file. It’s in json, but it basically specifies the same types of things. Entrypoint, ports, base image, environment etc.
For those wanting to go multi-cloud, kubernetes certainly has an appeal. But amazon is on the attack. They have announced a service to further ease container deployments. Dubbed Amazon Fargate. Remember how Lambda allowed you to just deploy your *code* into the cloud, and let amazon worry about the rest? Imaging you can do that with containers, and that’s what Fargate is.
There are a few different options for where to store those docker images.
One choice is dockerhub. It’s not feature rich, but it does the job. There is also Quay.io. Alternatively you can run your own registry. It’s as easy as:
$ docker run -d -p 5000:5000 registry:2
Of course if you’re running your own registry, now you need to manage that, and think about it’s uptime, and dependability to your deployment pipeline.
If you’re using ECS, you’ll be able to use ECR which is a private docker registry that comes with your AWS account. I think you can use this, even if you’re not on ECS. The login process is a little weird.
Once you have those pieces in place, you can do some fun things. Your jenkins deploy pipeline can use docker containers for testing, to spinup a copy of your app just to run some unittests, or it can build your images, and push them to your registry, for later use in ECS tasks or Kubernetes manifests. Awesome sauce!
I put together some of the most common ones I’ve heard, and my thoughts on the right solution.
1. How would you load test?
Here’s an interesting question. Do you talk about tools? How do you approach the problem?
The first thing I talk about is simulating real users. If your site normally has 1000 active users, how will it behave when it has 5000, 10,000, 100,000 or 1million? We can simulate this by using a load testing tool, and monitoring the infrastructure and database during that test.
But how accurate are those tests? What do active users do? Login to the site? Edit and change some data? Where do active users spend most of their time? Are there some areas of the site that are busier than others? What about some dark corner of the site or product that gets less use, but is also less tuned? Suddenly a few users decide that want that feature, and performance slides!
Real world usage patterns are unpredictable. There is as much art as science to this type of load testing.
2. Why is Amazon S3’s 99.999999999% promise *not* enough??
I’ve heard people say before that S3 is extremely reliable. But is it?
According to their SLA, the durability guarantee is 11 nines. What does this mean? Durability is confidence that a file is saved. That you will not lose it. It’s on storage, and that storage has redundant copies. Great. You can be confident you will never lose a file.
What about uptime? That SLA is 99.99% or an hour a month. Surprise! That amounts to an hour of DOWNTIME per month. And if your product fails when S3 files are missing, guess what, your business is down for an hour a month.
That’s actually quite a *lot* of downtime.
Solution: You better be doing cross-region replication. You have the tools and the cloud, now get to work!
I hear a lot of talk about continuous integration. I’ve even seen it as a line item on a todo list I was handed. Hmmm…
I asked the manager, “so it says here setup CI/CD. Are there already unit tests written? What about integration tests?” Turns out the team is not yet on board with writing tests. I gently explain that automated builds are not going to get very far without tests to run. 🙂
CI/CD requires the team to be on-board. It’s first a cultural change in how you develop code.
It means more regular code checkins. It means every engineer promises not to break the build.
It means write enough tests for good code coverage.
Amazon has made VPC peering a lot easier. But what the heck is it?
In the world of cloud networking, everything is virtual. You have VPCs and inside those you have public and private subnets. What if you have multiple AWS accounts? VPC peering can allow you to connect backend private subnets, without going across the public internet at all.
As security becomes front & center for more businesses, this can be a huge win.
What’s more it’s easier now because it is semi managed by AWS.
As more and more small startups put together teams to build their MVP, the offshore market has never been hotter. And there are very talented engineers in faraway places, from Eastern Europe, to India and China. And South America too.
At ⅓ to ¼ the price, why hire US based people? Well one reason might be compliance. If you have sensitive data, that must be handled by US nationals, that might be one reason.
Why New York based? Well there is the value of being face-to-face and working side by side with teams. It may also ease the language barrier & communication. And timezone challenges sometimes make communication difficult.
And lastly ownership. With resources that are focused solely on you, and for which you are a big customer, you’re likely to get more personalized focused attention.
Antipatterns are interesting. Because you see them regularly, and yet they are the *wrong* way to solve a problem, either they’re slower, or there is a better more reliable way to solve it.
o Using EFS Amazon’s NFS solution, instead of putting assets in S3.
It might help you avoid rewriting code, but in the end S3 is definitely the way to go.
o Hardcoded IPs in security group rules instead of naming a group.
Yes I’ve seen this. If you specify each webserver individually, what happens when you autoscale? Answer, the new nodes break! The solution is to put all the webservers in a group, and then add a security group rule allowing access from that group. Voila!
o Passing credentials around instead of using AWS instance level roles
Credentials are the bane of applications. You hardcode them and things break later. Or they create a vulnerability that you forget about. That’s why AWS invented roles. But did you know a server *itself* can have a role? That means that server and any software running on it, has permissions to certain APIs within the amazon universe. You can change a servers roles or it’s underlying policies, while the server is still running. No restart required!
Implement CI/CD as a task item
Don’t forget culture & process are the big hurdles. Installing a tool is easy. Getting everyone using it everyday is the challenge!
Reducing and managing docker image bloat
As you change your docker images, you add layers. Even as you delete things, the total image size only grows! Seems counterintuitive. What’s more when you do all that work with yum or apt-get those packages stay lying around. One way is to install packages and then cleanup all in one command. Another way is to export and import an finished image.
ssh-ing into servers or giving devs kubectl
Old habits die hard! I was watching Kelsey Hightower’s keynote at KubCon. He made some great points about kubernetes. If you give all the devs kubectl, then it’s kind of like allowing everybody to SSH into the boxes. It’s not the way to do it!