The value of staging environments

Today I learned why software development need a staging environment, being as close to a production environment as possible. The fact that it took over 20 years, may be an argument that the need is not always super high, but still.

The system here is EMG, an SMS gateway. One of its (many) features is the ability to pass messages to an external program or script. For both performance and security reasons, this is done by first starting a new child process, and then having the main program send program paths and messages to this child process, which in turn uses fork() and exec() as usual. With copy-on-write and whatnot, this may not be as needed as it was 20 years ago, but still. It works and it's fast, so there has been little need to change it.

A few years ago I started using Docker, both for development, tests, and production. Being able to work locally on a Mac, while still running production in Linux, has been fantastic. However, those "run a program" things has never worked in production. In the development environment and in all tests, everything works fine. Valgrind has no complaints about anything. But in production, it just doesn't work. This morning I got tired of not understanding why, and went bug hunting.

The main process has been logging that it cannot connect to the child process. This connection uses Unix domain sockets, and they were here configured to use the /tmp directory. There is really no universally good other place to have them without having to discuss permission bits, so I first made a small change, making it use internet sockets instead. This worked just fine in both development and with the tests, but the logged error in production was still the same. Ok, maybe there is something special that needs to be done in Docker environments? Nope, I found nothing relevant. Maybe binding to 0.0.0.0 works better than 127.0.0.1? Nope. Endian errors? Nope.

I tried running some commands within the Docker container where EMG was running. Then I suddenly saw that the child process that was started in the beginning, now was a zombie. As Valgrind was happy, it must had exited voluntarily for some reason. So, I started looking for places to add more logging, in order to track down where. That's when I found it.

In its main loop, the child process checks if the parent is still alive. If not, it exits too. The reason for this behavior is that child processes where the parent dies, get adopted by the init process, which has process id 1. Again, 20 years later, and no longer having to care about other Unix variants than Linux, there may be better ways to do this. Maybe fork() has a new "kill the child if the parent dies" flag? I dunno. Still, it works. So, this code is basically "if (getppid() == 1) exit(0);".

What makes the production environment special, is that there, the CMD in the Dockerfile, directly runs EMG. Docker containers live in their own space, so the process id's start over from 1. In other words, the parent EMG process had process id 1. In the development environment pid 1 is the command shell, and in the test environment it is a shell scripts which starts a database server and does some other things before finally starting one EMG process after another for the tests. The process ids were therefore always higher than 1, so the if statement above worked as intended. But in production, this caused the child to exit right away.

A better check is to let the main process get its process id before starting the child. The child then checks if its parent process differs from this initial value. If so, it exits. In the production container, this test is of course not needed at all, but at least it does not give any false positives any longer. When the process started with the CMD command exits, the Docker server kills the entire container, including the child process. Does this matter? Well, not really. The full test would most correctly be "(original_ppid != 1) && (current_ppid != original_ppid)". If the original parent process is not 1, as in development and test, the first part is true. So, the only thing that remains is the second part. In production where it is 1 and the first part is false, this code does not get executed anyway, as the child gets automatically killed. So, "(current_ppid != original_ppid)" behaves as it should in all environments, and there are no hard coded values.

Hence the need for a staging environment, which would use the same Docker configuration as production. This would have saved me from having to restart the production system while trying to figure out what was happening.

/Daniel