Last week, Amazon made their Elastic Container Service generally available and also released a web UI to make it easier to create tasks. This service was exciting for our personalization team as we wanted to leverage ECS to simplify deployment of our jobs and have better control over failures.
Getting started with ECS is easy but requires your full attention because there are a lot of important details (did you remember to set the correct IAM roles for your container instances?). If you haven’t done so already, make sure you go through the Getting Started documentation.
Once you’re ready to run your task what to do if something fails or doesn’t go as planned? Here are some of our experiences debugging ECS tasks.
Start with a simple application
In our case we created a Scala main method with very few dependencies. A sleep method is sufficient to simulate a job running and should provide enough time to see our container running.
Make sure your cluster has at least one container instance
Your ECS cluster requires at least one container instance to run your container. You can check the web console to make sure there is at least one running.
If you don’t see your instances here make sure you’ve followed the Container Instance documentation. In particular that you launched an instance with the correct ECS AMI, that you have attached the correct IAM role and that you have pointed your instance to your cluster.
Inspect the ECS logs
Once your container instance is attached to your cluster there will be an ecs-agent running on the machine. (note: the ecs-agent is open source and on github so you may be able to find help on the ‘issues’ page if needed ). The ecs-agent emits logs that can help you debug. You will need to ssh to the container instance and inspect the ecs-agent log:
For example, when inspecting this log it can be useful to know that the agent has pulled your docker image:
t=2015-04-14T15:46:53+0000 lvl=dbug msg=“Pulling image” module=TaskEngine image=<my-registry.my-domain>.com/<my-image>:latest status=“Pulling dependent layers”
Make sure the container instance installed your docker image
You can check to make sure the ecs-agent successfully downloaded your docker image. ssh to your container instance and run docker images. Your docker image should appear in this list.
If you don’t see your image check the ecs-agent logs and make sure your container instance is able to contact your docker registry by curling the registry and printing out the available tags:
Manually start your docker image
Checking to see if your image can be run from the command line is a great way to make sure the image can be run on the container instance. This also gives you an opportunity to inspect the application logs. For example,
docker run -it docker-registry.gilt.com/cerebro/svc-cerebro-job:latest bin/cerebro_ngram_job.sh production
Check for failed container processes
If your container can be run from the command line but is still not being run automatically through ECS Run Task then it’s likely you have something wrong in your ECS Task Definition. Check for failed container processes and inspect their container configuration.
To check for failed containers run docker ps -a.
Once you have the Container ID you can inspect the docker configuration. If you see an exit code other than 0 there may be an issue with your container configuration.
docker inspect <containerId>
For example, I noticed an exit code of 127 which indicates a bad command. This led me to fix a bug in the task definition CMD field that requires commands to be passed in json array format.
Finally, you can inspect the application log of the container.
docker logs <containerId>
For reference, here is a working task definition: