Categories
All Cloud Computing Devops

How I resolved some tough docker problems on Amazon ECS

via GIPHY

ECS is Amazon’s elastic container service. If you have a dockerized app, this is one way to get it deployed in the cloud. It is basically an Amazon bootleg Kubernetes clone. And not nearly as feature rich! πŸ™‚

Join 38,000 others and follow Sean Hull on twitter @hullsean.

That said, ECS does work, and it will allow you to get your application going on Amazon. Soon enough EKS (Amazon’s Kubernetes service) will be production, and we’ll all happily switch.

Meantime, if you’re struggling with the weird errors, and when it is silently failing, I have some help here for you. Hopefully these various error cases are ones you’ve run into, and this helps you solve them.

Why is my container in a stopped state?

Containers can fail for a lot of different reasons. The litany of causes I found were:

o port mismatches
o missing links in the task definition
o shortage of resources (see #2 below)

When ecs repeatedly fails, it leaves around stopped containers. These eat up system resources, without much visible feedback. “df -k” or “df -m” doesn’t show you volumes filled up. *BUT* there are logical volumes which can fill.

Do this to see the status:


[[email protected] ~]# lvdisplay
--- Logical volume ---
LV Name docker-pool
VG Name docker
LV UUID aSSS-fEEE-d333-V999-e999-a000-t11111
LV Write Access read/write
LV Creation host, time ip-10-111-40-30, 2018-04-21 18:16:19 +0000
LV Pool metadata docker-pool_tmeta
LV Pool data docker-pool_tdata
LV Status available
# open 3
LV Size 21.73 GiB
Allocated pool data 18.81%
Allocated metadata 6.10%
Current LE 5562
Segments 1
Allocation inherit
Read ahead sectors auto
- currently set to 256
Block device 253:2

[[email protected] ~]#

Related: 30 questions to ask a serverless fanboy

Why am I getting this error “Couldn’t run containers – reason=RESOURCE:PORTS”?

I was seeing errors like this. Your first thought might be that I have multiple containers on the same port. But no I didn’t have a port conflict.

What was happening was containers were failing, but in inconsistent ways. So docker had old copies still sitting around.

On the ecs host, use “docker ps -a” to list *ALL* containers. Then use “docker system prune” to cleanup old resources.


INFO[0000] Using ECS task definition TaskDefinition="docker:5"
INFO[0000] Couldn't run containers reason="RESOURCE:PORTS"
INFO[0000] Couldn't run containers reason="RESOURCE:PORTS"
INFO[0000] Starting container... container=750f3d42-a0ce-454b-ac38-f42791462b76/sean-redis
INFO[0000] Starting container... container=750f3d42-a0ce-454b-ac38-f42791462b76/sean-main
INFO[0000] Starting container... container=750f3d42-a0ce-454b-ac38-f42791462b76/sean-postgres
INFO[0000] Describe ECS container status container=750f3d42-a0ce-454b-ac38-f42791462b76/sean-postgres desiredStatus=RUNNING lastStatus=PENDING taskDefinition="docker:5"
INFO[0000] Describe ECS container status container=750f3d42-a0ce-454b-ac38-f42791462b76/sean-redis desiredStatus=RUNNING lastStatus=PENDING taskDefinition="docker:5"
INFO[0000] Describe ECS container status container=750f3d42-a0ce-454b-ac38-f42791462b76/sean-main desiredStatus=RUNNING lastStatus=PENDING taskDefinition="docker:5"

INFO[0007] Stopped container... container=750f3d42-a0ce-454b-ac38-f42791462b76/sean-postgres desiredStatus=STOPPED lastStatus=STOPPED taskDefinition="docker:5"
INFO[0007] Stopped container... container=750f3d42-a0ce-454b-ac38-f42791462b76/sean-redis desiredStatus=STOPPED lastStatus=STOPPED taskDefinition="docker:5"
INFO[0007] Stopped container... container=750f3d42-a0ce-454b-ac38-f42791462b76/sean-main desiredStatus=STOPPED lastStatus=STOPPED taskDefinition="docker:5"

Related: What’s the luckiest thing that’s happened in your career?

3. My container gets killed before fully started

When a service is run, ECS wants to have *all* of the containers running together. Just like when you use docker-compose. If one container fails, ecs-agent may decide to kill the entire service, and restart. So you may see weird things happening in “docker logs” for one container, simply because another failed. What to do?

First look at your task definition, and set “essential = false”. That way if one fails, the other will still run. So you can eliminate the working container as a cause.

Next thing is remember some containers may startup almost instantly, like nginx for example. Because it is a very small footprint, it can start in a second or two. So if *it* depends on another container that is slow, nginx will fail. That’s because in the strange world of docker discovery, that other container doesn’t even exist yet. While nginx references it, it says hey, I don’t see the upstream server you are pointing to.

Solution? Be sure you have a “links” section in your task definition. This tells ecs-agent, that one container depends on another (think of the depends_on flag in docker-compose).

Related: Curve ball interview questions and answers

4. Understanding container ordering

As you are building your ecs manifest aka task definition, you want to run through your docker-compose file carefully. Review the links, essential flags and depends_on settings. Then be sure to mirror those in your ECS task.

When in doubt, reduce the scope of your problem. That is define *only one* container, then start the service. Once that container works, add a second. When you get that working as well, add a third or other container.

This approach allows you to eliminate interconnecting dependencies, and related problems.

Related: Are generalists better at scaling the web?

Get more. Grab our exclusive monthly Scalable Startups. We share tips and special content. Our latest Why I don’t work with recruiters

Categories
All Consulting CTO/CIO

Curve ball technology questions and solutions

via GIPHY

I’ve been on the phone with a lot of companies lately. You might be surprised that some of the challenges firms struggle with in the cloud, are repeated over and over.

Join 38,000 others and follow Sean Hull on twitter @hullsean.

I put together some of the most common ones I’ve heard, and my thoughts on the right solution.

1. How would you load test?

Here’s an interesting question. Do you talk about tools? How do you approach the problem?

The first thing I talk about is simulating real users. If your site normally has 1000 active users, how will it behave when it has 5000, 10,000, 100,000 or 1million? We can simulate this by using a load testing tool, and monitoring the infrastructure and database during that test.

But how accurate are those tests? What do active users do? Login to the site? Edit and change some data? Where do active users spend most of their time? Are there some areas of the site that are busier than others? What about some dark corner of the site or product that gets less use, but is also less tuned? Suddenly a few users decide that want that feature, and performance slides!

Real world usage patterns are unpredictable. There is as much art as science to this type of load testing.

Related: 30 questions to ask a serverless fanboy

2. Why is Amazon S3’s 99.999999999% promise *not* enough??

I’ve heard people say before that S3 is extremely reliable. But is it?

According to their SLA, the durability guarantee is 11 nines. What does this mean? Durability is confidence that a file is saved. That you will not lose it. It’s on storage, and that storage has redundant copies. Great. You can be confident you will never lose a file.

What about uptime? That SLA is 99.99% or an hour a month. Surprise! That amounts to an hour of DOWNTIME per month. And if your product fails when S3 files are missing, guess what, your business is down for an hour a month.

That’s actually quite a *lot* of downtime.

Solution: You better be doing cross-region replication. You have the tools and the cloud, now get to work!

Related: What’s the luckiest thing that’s happened in your career?

3. Why is continuous integration not about tools?

I hear a lot of talk about continuous integration. I’ve even seen it as a line item on a todo list I was handed. Hmmm…

I asked the manager, “so it says here setup CI/CD. Are there already unit tests written? What about integration tests?” Turns out the team is not yet on board with writing tests. I gently explain that automated builds are not going to get very far without tests to run. πŸ™‚

CI/CD requires the team to be on-board. It’s first a cultural change in how you develop code.

It means more regular code checkins. It means every engineer promises not to break the build.

It means write enough tests for good code coverage.

Related: How I use progress reports to achieve consulting success

4. What can VPC peering do for you?

Amazon has made VPC peering a lot easier. But what the heck is it?

In the world of cloud networking, everything is virtual. You have VPCs and inside those you have public and private subnets. What if you have multiple AWS accounts? VPC peering can allow you to connect backend private subnets, without going across the public internet at all.

As security becomes front & center for more businesses, this can be a huge win.

What’s more it’s easier now because it is semi managed by AWS.

Related: Is upgrading Amazon RDS like a sh*t storm that will not end?

5. Why go with a New York based resource?

As more and more small startups put together teams to build their MVP, the offshore market has never been hotter. And there are very talented engineers in faraway places, from Eastern Europe, to India and China. And South America too.

At β…“ to ΒΌ the price, why hire US based people? Well one reason might be compliance. If you have sensitive data, that must be handled by US nationals, that might be one reason.

Why New York based? Well there is the value of being face-to-face and working side by side with teams. It may also ease the language barrier & communication. And timezone challenges sometimes make communication difficult.

And lastly ownership. With resources that are focused solely on you, and for which you are a big customer, you’re likely to get more personalized focused attention.

Related: Is Amazon Web Services too complex for small dev teams?

6. What are some common antipatterns in the cloud

Antipatterns are interesting. Because you see them regularly, and yet they are the *wrong* way to solve a problem, either they’re slower, or there is a better more reliable way to solve it.

o Using EFS Amazon’s NFS solution, instead of putting assets in S3.

It might help you avoid rewriting code, but in the end S3 is definitely the way to go.

o Hardcoded IPs in security group rules instead of naming a group.

Yes I’ve seen this. If you specify each webserver individually, what happens when you autoscale? Answer, the new nodes break! The solution is to put all the webservers in a group, and then add a security group rule allowing access from that group. Voila!

o Passing credentials around instead of using AWS instance level roles

Credentials are the bane of applications. You hardcode them and things break later. Or they create a vulnerability that you forget about. That’s why AWS invented roles. But did you know a server *itself* can have a role? That means that server and any software running on it, has permissions to certain APIs within the amazon universe. You can change a servers roles or it’s underlying policies, while the server is still running. No restart required!

Implement CI/CD as a task item

Don’t forget culture & process are the big hurdles. Installing a tool is easy. Getting everyone using it everyday is the challenge!

Reducing and managing docker image bloat

As you change your docker images, you add layers. Even as you delete things, the total image size only grows! Seems counterintuitive. What’s more when you do all that work with yum or apt-get those packages stay lying around. One way is to install packages and then cleanup all in one command. Another way is to export and import an finished image.

ssh-ing into servers or giving devs kubectl

Old habits die hard! I was watching Kelsey Hightower’s keynote at KubCon. He made some great points about kubernetes. If you give all the devs kubectl, then it’s kind of like allowing everybody to SSH into the boxes. It’s not the way to do it!

Related: Which tech do startups use most?

Get more. Grab our exclusive monthly Scalable Startups. We share tips and special content. Our latest Why I don’t work with recruiters