Brian Goff

April 16, 2021

Running non-native platform images with Docker, what could go wrong?

With many developers getting M1 Macs, developers are starting to do their work on arm64 based machines. This means `docker run ubuntu` on an M1 Mac actually pulls an arm64 image even though you didn't explicitly tell it to. That's because in the process of pulling the `ubuntu` image, Docker resolves the image name to a manifest list (aka an OCI index), which contains a listing of images corresponding to a platform reference. Here's an example:
{
  "manifests": [
    {
      "digest": "sha256:5403064f94b617f7975a19ba4d1a1299fd584397f6ee4393d0e16744ed11aab1",
      "mediaType": "application/vnd.docker.distribution.manifest.v2+json",
      "platform": {
        "architecture": "amd64",
        "os": "linux"
      },
      "size": 943
    },
    {
      "digest": "sha256:582958962eb61a775e17f3ea12dfade3431b9f657cda62a52d17b398284b5d13",
      "mediaType": "application/vnd.docker.distribution.manifest.v2+json",
      "platform": {
        "architecture": "arm",
        "os": "linux",
        "variant": "v7"
      },
      "size": 943
    },
    {
      "digest": "sha256:5aaaf1c47579707885e18385e3fa920c525c31cc2c32a0885c7c8f10f6d66009",
      "mediaType": "application/vnd.docker.distribution.manifest.v2+json",
      "platform": {
        "architecture": "arm64",
        "os": "linux",
        "variant": "v8"
      },
      "size": 943
    },
    {
      "digest": "sha256:b30065ff935c7761707eab66d3edc367e5fc1f3cc82c2e4addd69cee3b9e7c1c",
      "mediaType": "application/vnd.docker.distribution.manifest.v2+json",
      "platform": {
        "architecture": "ppc64le",
        "os": "linux"
      },
      "size": 943
    },
    {
      "digest": "sha256:fb02c202fa8f6097ab0f5ff52179c920a9df0f7d73ee4c82e41407640e1284c1",
      "mediaType": "application/vnd.docker.distribution.manifest.v2+json",
      "platform": {
        "architecture": "s390x",
        "os": "linux"
      },
      "size": 943
    }
  ],
  "mediaType": "application/vnd.docker.distribution.manifest.list.v2+json",
  "schemaVersion": 2
}

Docker fetches the right digest by filtering the list to the preferred platform, on M1 Macs that would be platform.os==linux && platform.architecture==arm64. If you want Docker to select a different platform you can do that with the `--platform` flag, e.g. `--platform=linux/arm/v7`.

Not all manifest lists are guaranteed to list the platform that Docker detects. For that case docker actually does some fuzzy matching for certain things (e.g. arm64 machines can run armv7 images natively, so docker will try to match that). You can also force Docker to use a specific platform `docker run --platform=linux/arm/v7 ubuntu`.
Not all images are guaranteed to have a manifest list. For instance, an image made from `docker push foo` will not have a manifest list. Manifest lists are created and pushed separately, often with a separate tool, by the image maintainer. In this case Docker does exactly what you asked it to do; run the image OR error if you provided --platform and there is no matching platform available for that image.

But... on these M1 Macs, most images "just work". You don't have to do anything special or push arm64 images, etc... so what is going on? How can it possibly run an image that only has an amd64 variant, for instance.

First, it has nothing special about M1 or anything macOS provides. There is also technically nothing special about what Docker is doing here. Docker isn't doing anything except running the image.
What's happening is Docker for Mac enables a feature in Linux called "binfmt_misc" which allows you to register a userspace application to run a particular binary format. In the Linux VM that Docker for Mac uses, it registers qemu to handle this for a wide array of platforms.
So when docker tries to run that amd64-only image on an M1 which is arm64, qemu actually picks it up and runs it.
Its a pretty cool feature that works really well with containers.
You don't need Docker for Mac to take advantage of it, Docker even has a container image you can run (needed once per boot) that registers everything for you:
docker run --rm --privileged docker/binfmt:a7996909642ee92942dcd6cff44b9b95f08dad64


My main reason for writing this is to be able to make people aware of what's actually happening on their system and understand that:
1. Yes, this is really cool and really useful, I use it every day for dealing with arm containers on my amd64 system
2. Beware of the dragons

Not to scare anyone off, of course. Again this is very useful, but running things in qemu is not the same thing as running it for real. Some things aren't even implemented in qemu (such as ptrace) which can break your workflow if you don't know this.
Also, because it is running in qemu, you can actually set some qemu specific debug flags if you run into trouble.
All these listed environment variables can be passed to your container (e.g. `docker run -e QEMU_STRACE=1 ubuntu echo hello`). Even though qemu is not included in the container image, binfmt_misc is executing qemu in the context of the container.
-g port              QEMU_GDB          wait gdb connection to 'port'
-L path              QEMU_LD_PREFIX    set the elf interpreter prefix to 'path'
-s size              QEMU_STACK_SIZE   set the stack size to 'size' bytes
-cpu model           QEMU_CPU          select CPU (-cpu help for list)
-E var=value         QEMU_SET_ENV      sets targets environment variable (see below)
-U var               QEMU_UNSET_ENV    unsets targets environment variable (see below)
-0 argv0             QEMU_ARGV0        forces target process argv[0] to be 'argv0'
-r uname             QEMU_UNAME        set qemu uname release string to 'uname'
-B address           QEMU_GUEST_BASE   set guest_base address to 'address'
-R size              QEMU_RESERVED_VA  reserve 'size' bytes for guest virtual address space
-d item[,...]        QEMU_LOG          enable logging of specified items (use '-d help' for a list of items)
-dfilter range[,...] QEMU_DFILTER      filter logging based on address range
-D logfile           QEMU_LOG_FILENAME write logs to 'logfile' (default stderr)
-p pagesize          QEMU_PAGESIZE     set the host page size to 'pagesize'
-singlestep          QEMU_SINGLESTEP   run in singlestep mode
-strace              QEMU_STRACE       log system calls
-seed                QEMU_RAND_SEED    Seed for pseudo-random number generator
-trace               QEMU_TRACE        [[enable=]<pattern>][,events=<file>][,file=<file>]
-version             QEMU_VERSION      display version information and exit

The normal flow of a container startup is docker -> containerd -> containerd-shim -> runc -> yourProcess.
When runc executes "yourProcess" the kernel decides, due to binfmt_misc, "oh I need to use qemu to run this".

This stuff gets really fun when you mix this with `docker build --platform=<desired target>` and `FROM --platform=$TARGETPLATFORM myFromImage`
You should totally play around with it.

Hope this helps!