How I resolved some tough docker problems on Amazon ECS


ECS is Amazon’s elastic container service. If you have a dockerized app, this is one way to get it deployed in the cloud. It is basically an Amazon bootleg Kubernetes clone. And not nearly as feature rich! ๐Ÿ™‚

Join 38,000 others and follow Sean Hull on twitter @hullsean.

That said, ECS does work, and it will allow you to get your application going on Amazon. Soon enough EKS (Amazon’s Kubernetes service) will be production, and we’ll all happily switch.

Meantime, if you’re struggling with the weird errors, and when it is silently failing, I have some help here for you. Hopefully these various error cases are ones you’ve run into, and this helps you solve them.

Why is my container in a stopped state?

Containers can fail for a lot of different reasons. The litany of causes I found were:

o port mismatches
o missing links in the task definition
o shortage of resources (see #2 below)

When ecs repeatedly fails, it leaves around stopped containers. These eat up system resources, without much visible feedback. “df -k” or “df -m” doesn’t show you volumes filled up. *BUT* there are logical volumes which can fill.

Do this to see the status:

[[email protected] ~]# lvdisplay
--- Logical volume ---
LV Name docker-pool
VG Name docker
LV UUID aSSS-fEEE-d333-V999-e999-a000-t11111
LV Write Access read/write
LV Creation host, time ip-10-111-40-30, 2018-04-21 18:16:19 +0000
LV Pool metadata docker-pool_tmeta
LV Pool data docker-pool_tdata
LV Status available
# open 3
LV Size 21.73 GiB
Allocated pool data 18.81%
Allocated metadata 6.10%
Current LE 5562
Segments 1
Allocation inherit
Read ahead sectors auto
- currently set to 256
Block device 253:2

[[email protected] ~]#

Related: 30 questions to ask a serverless fanboy

Why am I getting this error “Couldn’t run containers – reason=RESOURCE:PORTS”?

I was seeing errors like this. Your first thought might be that I have multiple containers on the same port. But no I didn’t have a port conflict.

What was happening was containers were failing, but in inconsistent ways. So docker had old copies still sitting around.

On the ecs host, use “docker ps -a” to list *ALL* containers. Then use “docker system prune” to cleanup old resources.

INFO[0000] Using ECS task definition TaskDefinition="docker:5"
INFO[0000] Couldn't run containers reason="RESOURCE:PORTS"
INFO[0000] Couldn't run containers reason="RESOURCE:PORTS"
INFO[0000] Starting container... container=750f3d42-a0ce-454b-ac38-f42791462b76/sean-redis
INFO[0000] Starting container... container=750f3d42-a0ce-454b-ac38-f42791462b76/sean-main
INFO[0000] Starting container... container=750f3d42-a0ce-454b-ac38-f42791462b76/sean-postgres
INFO[0000] Describe ECS container status container=750f3d42-a0ce-454b-ac38-f42791462b76/sean-postgres desiredStatus=RUNNING lastStatus=PENDING taskDefinition="docker:5"
INFO[0000] Describe ECS container status container=750f3d42-a0ce-454b-ac38-f42791462b76/sean-redis desiredStatus=RUNNING lastStatus=PENDING taskDefinition="docker:5"
INFO[0000] Describe ECS container status container=750f3d42-a0ce-454b-ac38-f42791462b76/sean-main desiredStatus=RUNNING lastStatus=PENDING taskDefinition="docker:5"

INFO[0007] Stopped container... container=750f3d42-a0ce-454b-ac38-f42791462b76/sean-postgres desiredStatus=STOPPED lastStatus=STOPPED taskDefinition="docker:5"
INFO[0007] Stopped container... container=750f3d42-a0ce-454b-ac38-f42791462b76/sean-redis desiredStatus=STOPPED lastStatus=STOPPED taskDefinition="docker:5"
INFO[0007] Stopped container... container=750f3d42-a0ce-454b-ac38-f42791462b76/sean-main desiredStatus=STOPPED lastStatus=STOPPED taskDefinition="docker:5"

Related: What’s the luckiest thing that’s happened in your career?

3. My container gets killed before fully started

When a service is run, ECS wants to have *all* of the containers running together. Just like when you use docker-compose. If one container fails, ecs-agent may decide to kill the entire service, and restart. So you may see weird things happening in “docker logs” for one container, simply because another failed. What to do?

First look at your task definition, and set “essential = false”. That way if one fails, the other will still run. So you can eliminate the working container as a cause.

Next thing is remember some containers may startup almost instantly, like nginx for example. Because it is a very small footprint, it can start in a second or two. So if *it* depends on another container that is slow, nginx will fail. That’s because in the strange world of docker discovery, that other container doesn’t even exist yet. While nginx references it, it says hey, I don’t see the upstream server you are pointing to.

Solution? Be sure you have a “links” section in your task definition. This tells ecs-agent, that one container depends on another (think of the depends_on flag in docker-compose).

Related: Curve ball interview questions and answers

4. Understanding container ordering

As you are building your ecs manifest aka task definition, you want to run through your docker-compose file carefully. Review the links, essential flags and depends_on settings. Then be sure to mirror those in your ECS task.

When in doubt, reduce the scope of your problem. That is define *only one* container, then start the service. Once that container works, add a second. When you get that working as well, add a third or other container.

This approach allows you to eliminate interconnecting dependencies, and related problems.

Related: Are generalists better at scaling the web?

Get more. Grab our exclusive monthly Scalable Startups. We share tips and special content. Our latest Why I don’t work with recruiters

Also published on Medium.