schneiderbox


Kelvandor on Docker: Two Problems, and Fixes

November 21, 2021

While I was discussing running Kelvandor on Docker with Glen, he wondered what sort of performance penalty the setup incurs.

This should be a straightforward question to answer: running the code’s benchmark suite inside and outside the container should give a good idea of the performance hit. There are additional questions that could be answered—such as how different hardware might respond, or if particular flags could improve performance—but that should give us a basic idea.

…unfortunately, Kelvandor doesn’t have a benchmark suite. Why haven’t I created a benchmark suite? It would be pretty useful, not just here but for all sorts of optimization or performance tasks. I really should make one. (This is an example of the iterative improvement on these projects. Every year the engineering behind the AI improves. I may not getting around to writing one for Kelvandor before the next game AI project starts in January, but I’m going to try to include one from the start in the new one.)

In the meantime, taking a few manual samples from both containerized and native Kelvandor should give us a ballpark idea.

However, in doing so, I encountered a couple issues.

Problem 1: Kelvandor, But No Kelvandor

When taking manual samples of Kelvandor in Docker, the API server that runs the kelvandor binary was complaining that it couldn’t find the binary. Sure enough, I started a shell in the container itself, and ran into a strange situation:

docker exec -it 1c6 /bin/sh
/opt/kelvandor # ls -l kelvandor
-rwxr-xr-x    1 root     root         43944 Nov 22 00:05 kelvandor
/opt/kelvandor # ./kelvandor
/bin/sh: ./kelvandor: not found

The kelvandor executable clearly exists, but when I try to run it, it is not found.

Turns out, this happens in Alpine Linux when there’s a dynamic linking error. But why would there be a linking error? I build the executable directly in my Dockerfile! See?

[...]
WORKDIR /opt/kelvandor
COPY ["src/", "."]
RUN ["make"]
[...]

…oh. I see. This also explains why it was working when I wrote my last post, but failing now.

The Dockerfile is copying all of src/ into the image. And, like many C projects, Kelvandor’s build artifacts, including kelvandor, are placed in src/ when you run make with src/ as your current working directory (which is a typical build practice—that’s where the Makefile is). So when building the image, Docker may copy an existing kelvandor binary, built for the host system’s libraries, into the image. That Stack Overflow link explains that instead of using glibc (the standard libc library used by most Linux distros), “Alpine Linux is based on the musl libc library, which is a minimal implementation and strictly POSIX compliant”.

So when I build the image, Docker copies the kelvandor binary built by my machine, linked against glibc, into the image. It then runs make, but make sees the existing binary and concludes it doesn’t need to be rebuilt. When the API tries to run the binary, it fails, because the image is based on Alpine Linux, and has musl instead of glibc. This situation is unfortunately communicated with a “not found” error, which is a somewhat obtuse description of the actual problem, and may cause the user to question the verifiable nature of reality.

The Fix

The first solution that occurred to me was to add a make clean command to the Dockerfile, ahead of the make command. This would delete the build artifacts, including kelvandor, so make will build them fresh. However, this will still include include the build artifacts in the layer resulting from the COPY command, and add another layer for the RUN make clean command.

I could say “make sure to run make clean(on the host) before building a Docker image”, to ensure there are no build artifacts to copy. But that is error-prone, and also has the undesirable property of requiring (seemingly) arbitrary changes to the host’s files only for the purpose of building the image.

It might be tempting to switch the image base to something other than Alpine Linux that uses glibc. However, that would be going against the intention of the Dockerfile—the plan is to build from source, not use a host-compiled binary. Changing the image would be applying a short-term fix with the potential for more errors down the road. We either need to fix it so the binary is built in the container, or change our image building strategy to be designed around using a host-compiled binary (say, by statically linking glibc when building kelvandor). However, the whole point of using Docker with Kelvandor was to make using Kelvandor easier, and requiring the user to have a properly set up build environment in order to build the correct binary is not in line with that goal.

The fix I ended up using was creating a .dockerignore file that tells Docker to not include certain files in the build context:

src/kelvandor
src/*.o
src/*.log
**/__pycache__

This means the kelvandor binary, as well as any object files, log files, and Python caches won’t be copied to the image (these are all things that might end up in src/). So when Docker runs make, the binary will be built fresh from the source files in the container.

Problem 2: You’re Ignoring My Argument

Secondly, I noticed that the containerized API was ignoring the Iterations request header. This is a feature that lets you set the number of iterations of the Monte Carlo tree search algorithm to perform in calculating the optimal move; more iterations will result in stronger play.

The UI has a setting for the user to set a specific number of iterations, which will determine the value of the Iterations header in the API request, which in turn is passed as the -i argument to the kelvandor binary. However, when using the containerized API, none of this seemed to have any effect. The API always used the default number of iterations as compiled into the binary.

In tracing the problem, I determined everything was functioning properly up to the kelvandor -i <iterations> call. The configured number was being properly interpreted by the API layer and passed to the binary.

Moreover, the binary used the proper number of iterations when I called it manually. So running ./kelvandor -i 20000 <serialized board> resulted in 20,000 iterations, but doing the equivalent from the UI did not. But, like I said, I verified the iterations setting was being passed correctly through the UI and API layers.

So why was it behaving differently?

The Fix

Here’s how I was manually testing the binary:

./kelvandor -i 20000 <serialized board>

Here’s the Python code where the API calls the binary:

p = Popen([KELVANDOR, board] + args, stdout=PIPE, stderr=PIPE)

And here’s an example of how an invocation from that might look:

./kelvandor <serialized board> -i 20000

See the difference?

Remember how above we said that Alpine Linux uses a “strictly POSIX compliant” implementation of libc? In strict POSIX, “the options end when the first non-option argument is encountered”. During my testing, I was putting the -i argument before the (required, non-option) board argument, but the API was putting it after. The difference wasn’t a problem when running the API on my glibc machine, since glibc goes beyond strict POSIX compliance and will return options after non-options. However, when running in the container on Alpine Linux, the strictly POSIX musl library ignores the trailing -i option.

In this case, the fix is easy. We’ll just change the API invocation to conform to POSIX:

p = Popen([KELVANDOR] + args + [board], stdout=PIPE, stderr=PIPE)

There wasn’t any real benefit to doing it the other way. I probably only wrote it that way because it was marginally more natural, and the possible problem hadn’t occurred to me.

Go Deeper

It is easy to see how both these problems could occur. They emerged from simple operations and seemingly insignificant minutiae.

However, despite their simple sources, both problems led to confusing situations where things that were clearly present were reported as missing or ignored. The fixes for both problems required a degree of understanding of below-the-surface technical details.

I would guess the technical level required to trigger these errors is below the technical level required to understand them. I imagine this is common in our field; it’s easy to get yourself into problems that you don’t understand (at least at first). This shows that it is always important to improve your knowledge beyond the bare requirements for getting things done. You never know when deeper understanding will suddenly be applicable to a current problem (or prevent a problem from emerging).

And, of course, with the complexity of our field, no one’s work is ever done with that!