Experimenting with eStargz image pulling on OpenShift

Posted on Dec 8, 2022 (updated Dec 13, 2022)

#container #openshift #kubernetes #estargz

Container images are meant to be small and lightweight, since they contain only the required runtime dependecies of an application. Unfortunately, in the real world this often looks different: it is not uncommon to see container images that are hundreds of megabytes (or even gigabytes!) big. One of the issues that these large images cause are long delays when starting a container (or pod in the Kubernetes world) using such an image. Traditionally, the image first needs to be fully downloaded and unpacked before the container runtime can start the container.

The Stargz Snapshotter project aims to change this: the special eStargz format for container images (that is fully compliant with the OCI format) allows lazy-pulling images: on start-up, only the files strictly necessary for running the containers' entrypoint are downloaded and extracted, the rest is handled on the fly. For all the details about eStargz, check the introductory blog post.

From my personal experience, large container images are especially common amongst beginners in the container ecosystem (due to being unfamiliar with all the technical implications) and coorporate users (for example because they are required to use specific container images for licensing reasons). Both of these groups are the target audience for Red Hat’s OpenShift Container Platform.

In this post I want to share my investigations on enabling the Stargz Snapshotter on OpenShift - specifically OKD, the community version of RedHat’s commercial OpenShift Container Platform (OCP).

In the first part of this post, I will showcase a manual prototype that modifies one of the cluster nodes in place. The second part of this post covers how to package these manual steps so they can be automatically applied by OpenShift’s node configuration facilities.

# Prototype

Let’s start with a quick local prototype on one of the cluster nodes. RedHat’s OpenShift uses RHCOS (RedHat CoreOS) images for all nodes in the cluster. OKD uses the open-source equivalent Fedora CoreOS (FCOS). Both of these Linux distributions are special because they come with a read-only root filesystem (except /etc and /var): all (permanent) changes must be made through CoreOS layering. Since that is a bit much effort for a prototype, we’ll just apply the changes locally on the node.

All the following commands will be run on the cluster node, so grab the SSH key for your cluster and run ssh -i cluster-ssh-key core@<NODE-IP> or grab your admin kubeconfig file and run oc debug node/<NODE-NAME> followed by chroot /host.

I’m using OKD 4.11.0-0.okd-2022-11-05-030711 with Fedora CoreOS 36 and CRI-O 1.24.3 in this example.

The README in the Stargz Snapshotter repository has some setup instructions for containerd, but OpenShift uses CRI-O instead. This means we cannot use the Stargz Snapshotter plugin, but need to use the Stargz Store instead (despite the different name, both plugins come from the same Git repository):

Stargz Snapshotter is a plugin for containerd, which enables it to perform lazy pulling of eStargz. This is an implementation of remote snapshotter plugin and provides remotely-mounted eStargz layers to containerd. Communication between containerd and Stargz Snapshotter is done with gRPC over unix socket. For more details about Stargz Snapshotter and the relationship with containerd, please refer to the doc.
If you are using CRI-O/Podman, you can’t use Stargz Snapshotter for enabling lazy pulling of eStargz. Instead, use Stargz Store plugin. This is an implementation of additional layer store plugin of CRI-O/Podman. Stargz Store provides remotely-mounted eStargz layers to CRI-O/Podman.

The following instructions are loosely based on “Install Stargz Store for CRI-O/Podman with systemd”.

First, the let’s grab the latest release and download the binary on the node:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
VERSION=v0.13.0
# download archive
curl -O -L https://github.com/containerd/stargz-snapshotter/releases/download/${VERSION}/stargz-snapshotter-${VERSION}-linux-amd64.tar.gz
# verify checksum
SHA256=4f3133a225c424a3dd075029a50efc44d28033099aa27ddf22e48fd2764b5301
echo "${SHA256}  stargz-snapshotter-${VERSION}-linux-amd64.tar.gz" | sha256sum -c
# extract `stargz-store` binary from the archive
tar -C /usr/local/bin -xvf stargz-snapshotter-${VERSION}-linux-amd64.tar.gz stargz-store
# validate the installation
/usr/local/bin/stargz-store -h
# Usage of /usr/local/bin/stargz-store:
#   -config string
#     	path to the configuration file (default "/etc/stargz-store/config.toml")
#   -log-level string
#     	set the logging level [trace, debug, info, warn, error, fatal, panic] (default "info")
#   -root string
#     	path to the root directory for this snapshotter (default "/var/lib/stargz-store")

Next, we need to modify the CRI-O configuration such that it uses the Stargz Store plugin for fetching and unpacking images:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
# make a backup of the original version (deployed by the machine-config-operator)
cp /etc/containers/storage.conf /etc/containers/storage.conf.bak
# config based on https://github.com/containerd/stargz-snapshotter/blob/aaa46a75dd97e401025f82630c9d3d4e41c9f670/script/config-cri-o/etc/containers/storage.conf
cat > /etc/containers/storage.conf <<EOF
[storage]
driver = "overlay"
graphroot = "/var/lib/containers/storage"
runroot = "/run/containers/storage"

[storage.options]
additionallayerstores = ["/var/lib/stargz-store/store:ref"]
EOF

The final step is setting up a systemd unit that will run the “Stargz Store” daemon.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
# systemd unit based on https://github.com/containerd/stargz-snapshotter/blob/main/script/config-cri-o/etc/systemd/system/stargz-store.service
cat > /etc/systemd/system/stargz-store.service <<EOF
[Unit]
Description=Stargz Store plugin for CRI-O
After=network.target
Before=crio.service

[Service]
Type=notify
Environment=HOME=/root
ExecStart=/usr/local/bin/stargz-store --log-level=debug --config=/etc/stargz-store/config.toml /var/lib/stargz-store/store
ExecStopPost=umount /var/lib/stargz-store/store
Restart=always
RestartSec=1

[Install]
WantedBy=multi-user.target
EOF

Now all that’s left to do is restarting the involved services and checking the log output:

1
2
3
systemctl daemon-reload
systemctl restart stargz-store crio
systemctl status stargz-store crio --output=cat

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
● stargz-store.service - Stargz Store plugin for CRI-O
     Loaded: loaded (/etc/systemd/system/stargz-store.service; disabled; vendor preset: disabled)
     Active: active (running) since Mon 2022-11-28 12:35:41 UTC; 2s ago
   Main PID: 194953 (stargz-store)
      Tasks: 9 (limit: 8601)
     Memory: 7.6M
        CPU: 79ms
     CGroup: /system.slice/stargz-store.service
             └─ 194953 /usr/local/bin/stargz-store --log-level=debug --config=/etc/stargz-store/config.toml /var/lib/stargz-store/store

Starting stargz-store.service - Stargz Store plugin for CRI-O...
{"level":"warning","msg":"content verification is not supported; switching to non-verification mode","time":"2022-11-28T12:35:41.543247929Z"}
{"level":"debug","msg":"SdNotifyReady notified=true, err=\u003cnil\u003e","time":"2022-11-28T12:35:41.580849701Z"}
Started stargz-store.service - Stargz Store plugin for CRI-O.

● crio.service - Container Runtime Interface for OCI (CRI-O)
     Loaded: loaded (/usr/lib/systemd/system/crio.service; disabled; vendor preset: disabled)
    Drop-In: /etc/systemd/system/crio.service.d
             └─10-mco-default-madv.conf, 10-mco-profile-unix-socket.conf, 20-nodenet.conf
     Active: active (running) since Mon 2022-11-28 12:35:43 UTC; 52ms ago
       Docs: https://github.com/cri-o/cri-o
   Main PID: 195096 (crio)
      Tasks: 10
     Memory: 22.7M
        CPU: 821ms
     CGroup: /system.slice/crio.service
             └─ 195096 /usr/bin/crio

Got pod network &{Name:ingress-canary-fl7g5 Namespace:openshift-ingress-canary ID:41c9ecbda2cca10a7b18a35c26bce>
Checking pod openshift-ingress-canary_ingress-canary-fl7g5 for CNI network multus-cni-network (type=multus)
Got pod network &{Name:fluentd-k8k2d Namespace:openshift-logging ID:9d1a30acf6abdfc05d0a98edbb66771795ea8c2b722>
Checking pod openshift-logging_fluentd-k8k2d for CNI network multus-cni-network (type=multus)
Got pod network &{Name:dns-default-96bsf Namespace:openshift-dns ID:a18d86d30eb4f23d71b0a7dc5988c5edfcaf40780fa>
Checking pod openshift-dns_dns-default-96bsf for CNI network multus-cni-network (type=multus)
Got pod network &{Name:network-metrics-daemon-bk7s2 Namespace:openshift-multus ID:8cbf5cf6f092a6b58901f5d500105>
Checking pod openshift-multus_network-metrics-daemon-bk7s2 for CNI network multus-cni-network (type=multus)
Serving metrics on :9537 via HTTP
Started crio.service - Container Runtime Interface for OCI (CRI-O).

Both services should display active (running).

Now it is time for our first unscientific test: pulling an estargz-optimized image and comparing the startup time to the “regular” version. In the following examples I’ll be using images from the list of pre-converted images.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
$ crictl rmi --prune # make sure we don't already have any blobs locally

$ time crictl pull ghcr.io/stargz-containers/python:3.10-org
Image is up to date for ghcr.io/stargz-containers/python@sha256:b1c16e981e9d711ed60f56ab6227687b92e8671744d542dbdca80be9be7a875c

real	0m22.805s
user	0m0.039s
sys	0m0.029s

$ crictl rmi --prune
Deleted: ghcr.io/stargz-containers/python:3.10-org
Deleted: docker.io/library/busybox:latest

$ time crictl pull ghcr.io/stargz-containers/python:3.10-esgz
Image is up to date for ghcr.io/stargz-containers/python@sha256:167721f6ae9e2609293f122e3fd14df35e39960ac0cf530b43d4aded77d08783

real	0m8.364s
user	0m0.032s
sys	0m0.028s

Less then half the time for pulling the image!

Let’s see if we can also observe a difference when creating a Kubernetes pod. Note that we need to specify spec.nodeName to ensure the pod gets scheduled on the node we just prepared.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
NODE_NAME=<NODE-NAME>
oc create -f - <<EOF
apiVersion: v1
kind: Pod
metadata:
  name: test-estargz
spec:
  containers:
  - image: ghcr.io/stargz-containers/wordpress:5.9.2-esgz
    name: test
  nodeName: ${NODE_NAME}
---
apiVersion: v1
kind: Pod
metadata:
  name: test-regular
spec:
  containers:
  - image: ghcr.io/stargz-containers/wordpress:5.9.2-org
    name: test
  nodeName: ${NODE_NAME}
EOF
oc get pods -o wide --watch

1
2
3
4
5
6
7
NAME           READY   STATUS              RESTARTS   AGE   IP       NODE
test-estargz   0/1     ContainerCreating   0          1s    <none>   standard-zjdmw
test-regular   0/1     ContainerCreating   0          1s    <none>   standard-zjdmw
test-estargz   0/1     ContainerCreating   0          3s    <none>   standard-zjdmw
test-regular   0/1     ContainerCreating   0          2s    <none>   standard-zjdmw
test-estargz   1/1     Running             0          8s    10.76.14.21   standard-zjdmw
test-regular   1/1     Running             0          19s   10.76.14.22   standard-zjdmw

From this output we can see that both containers were scheduled on the prepared node at the same time (1s after creation). The container using the estargz-formatted image started after just 8 seconds, whereas the regular image took 19 seconds.

We can therefore conclude that the prototype is working!

# The Zen of OpenShift

OpenShift is an operator-based platform: every configuration and deployment change should be declaratively described in the Kubernetes and implemented by operators (a.k.a controllers). This is completely at odds with us going around on node and monkey-patching config files and system services! Let’s fix that now by putting those modifications into OpenShift’s node configuration facilities.

In principal, we should use the containerruntimeconfig API for managing the configuration of CRI-O and its subcomponents.

Unfortunately, at the moment this API supports just a handful of fields, whereas the /etc/containers/storage.conf is completely hardcoded in the machine-config-operator. Similarly, the underlying library currently does not support drop-in configurations such as /etc/containers/conf.d/my-config-override.conf.

Therefore, we’ll need to overwrite this file with a custom MachineConfig:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
# stargz-worker-machineconfig.yaml
apiVersion: machineconfiguration.openshift.io/v1
kind: MachineConfig
metadata:
  name: 99-worker-stargz
  labels:
    machineconfiguration.openshift.io/role: worker
spec:
  config:
    ignition:
      version: 3.2.0

    # https://coreos.github.io/ignition/examples/#create-files-on-the-root-filesystem
    storage:
      files:
      - path: "/etc/containers/storage.conf"
        mode: 420 # corresponds to 644 in octal
        overwrite: true
        # Base64 encoded version of storage.conf in data URL scheme
        # https://www.rfc-editor.org/rfc/rfc2397
        contents:
          source: data:text/plain;charset=utf-8;base64,W3N0b3JhZ2VdCmRyaXZlciA9ICJvdmVybGF5IgpncmFwaHJvb3QgPSAiL3Zhci9saWIvY29udGFpbmVycy9zdG9yYWdlIgpydW5yb290ID0gIi9ydW4vY29udGFpbmVycy9zdG9yYWdlIgoKW3N0b3JhZ2Uub3B0aW9uc10KYWRkaXRpb25hbGxheWVyc3RvcmVzID0gWyIvdmFyL2xpYi9zdGFyZ3otc3RvcmUvc3RvcmU6cmVmIl0=

    systemd:
      units:
      - name: "stargz.service"
        enabled: true
        contents: |
          [Unit]
          Description=Stargz Store plugin for CRI-O
          Before=crio.service
          After=network.target

          [Service]
          Type=notify
          Environment=HOME=/root
          Environment=STARGZ_VERSION=v0.13.0
          Environment=STARGZ_SHA256=4f3133a225c424a3dd075029a50efc44d28033099aa27ddf22e48fd2764b5301
          # 1. Ensure fuse kernel module is loaded
          # 2. Download stargz archive and verify checksum
          # 3. Unpack the binary
          ExecStartPre=/bin/sh -xec 'modprobe fuse && \
          curl -o /tmp/stargz.tar.gz -sL https://github.com/containerd/stargz-snapshotter/releases/download/${STARGZ_VERSION}/stargz-snapshotter-${STARGZ_VERSION}-linux-amd64.tar.gz && \
          echo "${STARGZ_SHA256}  /tmp/stargz.tar.gz" | sha256sum -c && \
          tar -C /usr/local/bin -xvf /tmp/stargz.tar.gz stargz-store && \
          rm /tmp/stargz.tar.gz'
          # Start stargz-store daemon
          ExecStart=/usr/local/bin/stargz-store --log-level=debug --config=/etc/stargz-store/config.toml /var/lib/stargz-store/store
          ExecStopPost=umount /var/lib/stargz-store/store
          Restart=always
          RestartSec=1

          [Install]
          WantedBy=multi-user.target

This MachineConfig will be rolled out to all worker nodes (line X), i.e. nodes that have the label node-role.kubernetes.io/worker="". Effectively, it puts two files into the filesystem of the node: /etc/containers/storage.conf and /etc/systemd/system/stargz.service. I won’t describe in detail how these ignition configuration files work (refer to the CoreOS documentation), but I do want to point out that the content of the storage.conf file (line X) must be data URL encoded.

The systemd unit file (line X-X) has been extended to not only start the stargz-store daemon, but also download the release archive, verify its checksum and install the executable on the system. It might be worth considering to download the archive from a local cache / mirror, depending on how often the cluster nodes reboot, new ones get added and how many nodes the cluster has in total.

Once the MachineConfig is injected into the cluster (oc create -f stargz-worker-machineconfig.yaml), the machine-config-operator will update all worker nodes and reboot them (the reboot ensures the stargz and crio service are started in the right order). Afterwards, we can confirm that is has been correctly applied by creating another pod and examining the status of the stargz service on one of the nodes:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
$ oc run jdk --image=ghcr.io/stargz-containers/tomcat:10.1.0-jdk17-openjdk-bullseye-esgz
pod/jdk created
$ oc get pods -w
NAME    READY   STATUS              RESTARTS   AGE  IP            NODE
jdk     0/1     ContainerCreating   0          4s   10.76.14.16   standard-zjdmw
jdk     1/1     Running             0          12s  10.76.14.16   standard-zjdmw

$ oc -n default debug node/standard-zjdmw -- chroot /host systemctl status stargz.service --output=cat
● stargz.service - Stargz Store plugin for CRI-O
     Loaded: loaded (/etc/systemd/system/stargz.service; enabled; vendor preset: disabled)
     Active: active (running) since Wed 2022-12-07 09:11:10 UTC; 2min 45s ago
   Main PID: 1819824 (stargz-store)
      Tasks: 14 (limit: 8601)
     Memory: 596.8M
        CPU: 9.959s
     CGroup: /system.slice/stargz.service
             └─ 1819824 /usr/local/bin/stargz-store --log-level=debug --config=/etc/stargz-store/config.toml /var/lib/stargz-store/store

{"level":"debug","msg":"completed to prefetch","time":"2022-12-07T09:13:11.442500820Z"}
{"level":"debug","msg":"completed to prefetch","time":"2022-12-07T09:13:11.442527339Z"}
{"layer_sha":"sha256:9de62bcddd24077a0438f955112816f6f64e01341b7bb862869f37611d338fdc","level":"debug","metrics":"latency","msg":"value=9983.725378 milliseconds","operation":"background_fetch_decompress","time":"2022-12-07T09:13:16.461004747Z"}
{"layer_sha":"sha256:9de62bcddd24077a0438f955112816f6f64e01341b7bb862869f37611d338fdc","level":"debug","metrics":"latency","msg":"value=9983.815997 milliseconds","operation":"background_fetch_total","time":"2022-12-07T09:13:16.461090344Z"}
{"level":"debug","msg":"completed to fetch all layer data in background","time":"2022-12-07T09:13:16.472434433Z"}
{"level":"debug","msg":"completed to fetch all layer data in background","time":"2022-12-07T09:13:16.472476962Z"}
{"layer_sha":"sha256:89400f0cd35f146443d5592c18622391509d8df109f7e3b68e7e41bf1fa6bf42","level":"debug","metrics":"latency","msg":"value=10442.093411 milliseconds","operation":"background_fetch_decompress","time":"2022-12-07T09:13:16.943490865Z"}
{"layer_sha":"sha256:89400f0cd35f146443d5592c18622391509d8df109f7e3b68e7e41bf1fa6bf42","level":"debug","metrics":"latency","msg":"value=10442.154206 milliseconds","operation":"background_fetch_total","time":"2022-12-07T09:13:16.943546689Z"}
{"level":"debug","msg":"completed to fetch all layer data in background","time":"2022-12-07T09:13:16.943564897Z"}
{"level":"debug","msg":"completed to fetch all layer data in background","time":"2022-12-07T09:13:16.943576759Z"}

Happy pulling!