Kubernetes: Unexpected 502 errors when rolling out a new ReplicaSet

The other day I was working on migrating a platform at a client from Virtual Machine based deployments using Ansible to Container-based deployments using Kubernetes.

The whole migration went smooth until we started noticing Bad Gateway (HTTP status 502) errors during load testing. At first we thought that the readiness probes were misconfigured and that the FPM and NGINX pods were considered ready before they actually were.

Once we reconfigured the readiness probes we saw a drop in these 502 errors but they were not eliminated entirely. It took a colleague of mine quite some time and effort to trace the problem and found that NGINX will perform a ‘fast shutdown’ when it receives the SIGTERM signal. This signal is the default one that docker uses to gracefully shut down a container.

So what happens? NGINX doesn’t gracefully shut down at all and aborts any running requests and proxies calling it will return: HTTP Status Code 502.

In order to avoid this issue we want NGINX to gracefully shut down a container instead of killing the process outright. We accomplished that by adding the following line to the NGINX Dockerfile:

STOPSIGNAL SIGQUIT

This will instruct Docker to use the SIGQUIT signal to shut down the container, and this is the signal used by NGINX to perform a graceful shutdown. More information on which signals NGINX handles and how it handles them can be found at their documentation.