Container images are meant to be small and lightweight, since they contain only the required runtime dependecies of an application.
Unfortunately, in the real world this often looks different: it is not uncommon to see container images that are hundreds of megabytes (or even gigabytes!) big.
One of the issues that these large images cause are long delays when starting a container (or pod in the Kubernetes world) using such an image.
Traditionally, the image first needs to be fully downloaded and unpacked before the container runtime can start the container.
The Stargz Snapshotter project aims to change this: the special eStargz format for container images (that is fully compliant with the OCI format) allows lazy-pulling images: on start-up, only the files strictly necessary for running the containers’ entrypoint are downloaded and extracted, the rest is handled on the fly.
For all the details about eStargz, check the introductory blog post.
From my personal experience, large container images are especially common amongst beginners in the container ecosystem (due to being unfamiliar with all the technical implications) and coorporate users (for example because they are required to use specific container images for licensing reasons).
Both of these groups are the target audience for Red Hat’s OpenShift Container Platform.
In this post I want to share my investigations on enabling the Stargz Snapshotter on OpenShift - specifically OKD, the community version of RedHat’s commercial OpenShift Container Platform (OCP).
In the first part of this post, I will showcase a manual prototype that modifies one of the cluster nodes in place.
The second part of this post covers how to package these manual steps so they can be automatically applied by OpenShift’s node configuration facilities.
Prototype
Let’s start with a quick local prototype on one of the cluster nodes.
RedHat’s OpenShift uses RHCOS (RedHat CoreOS) images for all nodes in the cluster.
OKD uses the open-source equivalent Fedora CoreOS (FCOS).
Both of these Linux distributions are special because they come with a read-only root filesystem (except /etc
and /var
): all (permanent) changes must be made through CoreOS layering.
Since that is a bit much effort for a prototype, we’ll just apply the changes locally on the node.
All the following commands will be run on the cluster node, so grab the SSH key for your cluster and run ssh -i cluster-ssh-key core@<NODE-IP>
or grab your admin kubeconfig file and run oc debug node/<NODE-NAME>
followed by chroot /host
.
I’m using OKD 4.11.0-0.okd-2022-11-05-030711
with Fedora CoreOS 36 and CRI-O 1.24.3 in this example.
The README in the Stargz Snapshotter repository has some setup instructions for containerd, but OpenShift uses CRI-O instead.
This means we cannot use the Stargz Snapshotter
plugin, but need to use the Stargz Store
instead (despite the different name, both plugins come from the same Git repository):
Stargz Snapshotter is a plugin for containerd, which enables it to perform lazy pulling of eStargz. This is an implementation of remote snapshotter plugin and provides remotely-mounted eStargz layers to containerd. Communication between containerd and Stargz Snapshotter is done with gRPC over unix socket. For more details about Stargz Snapshotter and the relationship with containerd, please refer to the doc.
If you are using CRI-O/Podman, you can’t use Stargz Snapshotter for enabling lazy pulling of eStargz. Instead, use Stargz Store plugin. This is an implementation of additional layer store plugin of CRI-O/Podman. Stargz Store provides remotely-mounted eStargz layers to CRI-O/Podman.
The following instructions are loosely based on “Install Stargz Store for CRI-O/Podman with systemd”.
First, the let’s grab the latest release and download the binary on the node:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
| VERSION=v0.13.0
# download archive
curl -O -L https://github.com/containerd/stargz-snapshotter/releases/download/${VERSION}/stargz-snapshotter-${VERSION}-linux-amd64.tar.gz
# verify checksum
SHA256=4f3133a225c424a3dd075029a50efc44d28033099aa27ddf22e48fd2764b5301
echo "${SHA256} stargz-snapshotter-${VERSION}-linux-amd64.tar.gz" | sha256sum -c
# extract `stargz-store` binary from the archive
tar -C /usr/local/bin -xvf stargz-snapshotter-${VERSION}-linux-amd64.tar.gz stargz-store
# validate the installation
/usr/local/bin/stargz-store -h
# Usage of /usr/local/bin/stargz-store:
# -config string
# path to the configuration file (default "/etc/stargz-store/config.toml")
# -log-level string
# set the logging level [trace, debug, info, warn, error, fatal, panic] (default "info")
# -root string
# path to the root directory for this snapshotter (default "/var/lib/stargz-store")
|
Next, we need to modify the CRI-O configuration such that it uses the Stargz Store
plugin for fetching and unpacking images:
1
2
3
4
5
6
7
8
9
10
11
12
| # make a backup of the original version (deployed by the machine-config-operator)
cp /etc/containers/storage.conf /etc/containers/storage.conf.bak
# config based on https://github.com/containerd/stargz-snapshotter/blob/aaa46a75dd97e401025f82630c9d3d4e41c9f670/script/config-cri-o/etc/containers/storage.conf
cat > /etc/containers/storage.conf <<EOF
[storage]
driver = "overlay"
graphroot = "/var/lib/containers/storage"
runroot = "/run/containers/storage"
[storage.options]
additionallayerstores = ["/var/lib/stargz-store/store:ref"]
EOF
|
The final step is setting up a systemd unit that will run the “Stargz Store” daemon.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
| # systemd unit based on https://github.com/containerd/stargz-snapshotter/blob/main/script/config-cri-o/etc/systemd/system/stargz-store.service
cat > /etc/systemd/system/stargz-store.service <<EOF
[Unit]
Description=Stargz Store plugin for CRI-O
After=network.target
Before=crio.service
[Service]
Type=notify
Environment=HOME=/root
ExecStart=/usr/local/bin/stargz-store --log-level=debug --config=/etc/stargz-store/config.toml /var/lib/stargz-store/store
ExecStopPost=umount /var/lib/stargz-store/store
Restart=always
RestartSec=1
[Install]
WantedBy=multi-user.target
EOF
|
Now all that’s left to do is restarting the involved services and checking the log output:
1
2
3
| systemctl daemon-reload
systemctl restart stargz-store crio
systemctl status stargz-store crio --output=cat
|
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
| ● stargz-store.service - Stargz Store plugin for CRI-O
Loaded: loaded (/etc/systemd/system/stargz-store.service; disabled; vendor preset: disabled)
Active: active (running) since Mon 2022-11-28 12:35:41 UTC; 2s ago
Main PID: 194953 (stargz-store)
Tasks: 9 (limit: 8601)
Memory: 7.6M
CPU: 79ms
CGroup: /system.slice/stargz-store.service
└─ 194953 /usr/local/bin/stargz-store --log-level=debug --config=/etc/stargz-store/config.toml /var/lib/stargz-store/store
Starting stargz-store.service - Stargz Store plugin for CRI-O...
{"level":"warning","msg":"content verification is not supported; switching to non-verification mode","time":"2022-11-28T12:35:41.543247929Z"}
{"level":"debug","msg":"SdNotifyReady notified=true, err=\u003cnil\u003e","time":"2022-11-28T12:35:41.580849701Z"}
Started stargz-store.service - Stargz Store plugin for CRI-O.
● crio.service - Container Runtime Interface for OCI (CRI-O)
Loaded: loaded (/usr/lib/systemd/system/crio.service; disabled; vendor preset: disabled)
Drop-In: /etc/systemd/system/crio.service.d
└─10-mco-default-madv.conf, 10-mco-profile-unix-socket.conf, 20-nodenet.conf
Active: active (running) since Mon 2022-11-28 12:35:43 UTC; 52ms ago
Docs: https://github.com/cri-o/cri-o
Main PID: 195096 (crio)
Tasks: 10
Memory: 22.7M
CPU: 821ms
CGroup: /system.slice/crio.service
└─ 195096 /usr/bin/crio
Got pod network &{Name:ingress-canary-fl7g5 Namespace:openshift-ingress-canary ID:41c9ecbda2cca10a7b18a35c26bce>
Checking pod openshift-ingress-canary_ingress-canary-fl7g5 for CNI network multus-cni-network (type=multus)
Got pod network &{Name:fluentd-k8k2d Namespace:openshift-logging ID:9d1a30acf6abdfc05d0a98edbb66771795ea8c2b722>
Checking pod openshift-logging_fluentd-k8k2d for CNI network multus-cni-network (type=multus)
Got pod network &{Name:dns-default-96bsf Namespace:openshift-dns ID:a18d86d30eb4f23d71b0a7dc5988c5edfcaf40780fa>
Checking pod openshift-dns_dns-default-96bsf for CNI network multus-cni-network (type=multus)
Got pod network &{Name:network-metrics-daemon-bk7s2 Namespace:openshift-multus ID:8cbf5cf6f092a6b58901f5d500105>
Checking pod openshift-multus_network-metrics-daemon-bk7s2 for CNI network multus-cni-network (type=multus)
Serving metrics on :9537 via HTTP
Started crio.service - Container Runtime Interface for OCI (CRI-O).
|
Both services should display active (running)
.
Now it is time for our first unscientific test: pulling an estargz-optimized image and comparing the startup time to the “regular” version.
In the following examples I’ll be using images from the list of pre-converted images.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
| $ crictl rmi --prune # make sure we don't already have any blobs locally
$ time crictl pull ghcr.io/stargz-containers/python:3.10-org
Image is up to date for ghcr.io/stargz-containers/python@sha256:b1c16e981e9d711ed60f56ab6227687b92e8671744d542dbdca80be9be7a875c
real 0m22.805s
user 0m0.039s
sys 0m0.029s
$ crictl rmi --prune
Deleted: ghcr.io/stargz-containers/python:3.10-org
Deleted: docker.io/library/busybox:latest
$ time crictl pull ghcr.io/stargz-containers/python:3.10-esgz
Image is up to date for ghcr.io/stargz-containers/python@sha256:167721f6ae9e2609293f122e3fd14df35e39960ac0cf530b43d4aded77d08783
real 0m8.364s
user 0m0.032s
sys 0m0.028s
|
Less then half the time for pulling the image!
Let’s see if we can also observe a difference when creating a Kubernetes pod.
Note that we need to specify spec.nodeName
to ensure the pod gets scheduled on the node we just prepared.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
| NODE_NAME=<NODE-NAME>
oc create -f - <<EOF
apiVersion: v1
kind: Pod
metadata:
name: test-estargz
spec:
containers:
- image: ghcr.io/stargz-containers/wordpress:5.9.2-esgz
name: test
nodeName: ${NODE_NAME}
---
apiVersion: v1
kind: Pod
metadata:
name: test-regular
spec:
containers:
- image: ghcr.io/stargz-containers/wordpress:5.9.2-org
name: test
nodeName: ${NODE_NAME}
EOF
oc get pods -o wide --watch
|
1
2
3
4
5
6
7
| NAME READY STATUS RESTARTS AGE IP NODE
test-estargz 0/1 ContainerCreating 0 1s <none> standard-zjdmw
test-regular 0/1 ContainerCreating 0 1s <none> standard-zjdmw
test-estargz 0/1 ContainerCreating 0 3s <none> standard-zjdmw
test-regular 0/1 ContainerCreating 0 2s <none> standard-zjdmw
test-estargz 1/1 Running 0 8s 10.76.14.21 standard-zjdmw
test-regular 1/1 Running 0 19s 10.76.14.22 standard-zjdmw
|
From this output we can see that both containers were scheduled on the prepared node at the same time (1s
after creation).
The container using the estargz-formatted image started after just 8 seconds, whereas the regular image took 19 seconds.
We can therefore conclude that the prototype is working!
The Zen of OpenShift
OpenShift is an operator-based platform: every configuration and deployment change should be declaratively described in the Kubernetes and implemented by operators (a.k.a controllers).
This is completely at odds with us going around on node and monkey-patching config files and system services!
Let’s fix that now by putting those modifications into OpenShift’s node configuration facilities.
In principal, we should use the containerruntimeconfig API for managing the configuration of CRI-O and its subcomponents.
Unfortunately, at the moment this API supports just a handful of fields, whereas the /etc/containers/storage.conf
is completely hardcoded in the machine-config-operator.
Similarly, the underlying library currently does not support drop-in configurations such as /etc/containers/conf.d/my-config-override.conf
.
Therefore, we’ll need to overwrite this file with a custom MachineConfig:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
| # stargz-worker-machineconfig.yaml
apiVersion: machineconfiguration.openshift.io/v1
kind: MachineConfig
metadata:
name: 99-worker-stargz
labels:
machineconfiguration.openshift.io/role: worker
spec:
config:
ignition:
version: 3.2.0
# https://coreos.github.io/ignition/examples/#create-files-on-the-root-filesystem
storage:
files:
- path: "/etc/containers/storage.conf"
mode: 420 # corresponds to 644 in octal
overwrite: true
# Base64 encoded version of storage.conf in data URL scheme
# https://www.rfc-editor.org/rfc/rfc2397
contents:
source: data:text/plain;charset=utf-8;base64,W3N0b3JhZ2VdCmRyaXZlciA9ICJvdmVybGF5IgpncmFwaHJvb3QgPSAiL3Zhci9saWIvY29udGFpbmVycy9zdG9yYWdlIgpydW5yb290ID0gIi9ydW4vY29udGFpbmVycy9zdG9yYWdlIgoKW3N0b3JhZ2Uub3B0aW9uc10KYWRkaXRpb25hbGxheWVyc3RvcmVzID0gWyIvdmFyL2xpYi9zdGFyZ3otc3RvcmUvc3RvcmU6cmVmIl0=
systemd:
units:
- name: "stargz.service"
enabled: true
contents: |
[Unit]
Description=Stargz Store plugin for CRI-O
Before=crio.service
After=network.target
[Service]
Type=notify
Environment=HOME=/root
Environment=STARGZ_VERSION=v0.13.0
Environment=STARGZ_SHA256=4f3133a225c424a3dd075029a50efc44d28033099aa27ddf22e48fd2764b5301
# 1. Ensure fuse kernel module is loaded
# 2. Download stargz archive and verify checksum
# 3. Unpack the binary
ExecStartPre=/bin/sh -xec 'modprobe fuse && \
curl -o /tmp/stargz.tar.gz -sL https://github.com/containerd/stargz-snapshotter/releases/download/${STARGZ_VERSION}/stargz-snapshotter-${STARGZ_VERSION}-linux-amd64.tar.gz && \
echo "${STARGZ_SHA256} /tmp/stargz.tar.gz" | sha256sum -c && \
tar -C /usr/local/bin -xvf /tmp/stargz.tar.gz stargz-store && \
rm /tmp/stargz.tar.gz'
# Start stargz-store daemon
ExecStart=/usr/local/bin/stargz-store --log-level=debug --config=/etc/stargz-store/config.toml /var/lib/stargz-store/store
ExecStopPost=umount /var/lib/stargz-store/store
Restart=always
RestartSec=1
[Install]
WantedBy=multi-user.target
|
This MachineConfig will be rolled out to all worker
nodes (line X), i.e. nodes that have the label node-role.kubernetes.io/worker=""
.
Effectively, it puts two files into the filesystem of the node: /etc/containers/storage.conf
and /etc/systemd/system/stargz.service
.
I won’t describe in detail how these ignition configuration files work (refer to the CoreOS documentation), but I do want to point out that the content of the storage.conf
file (line X) must be data URL encoded.
The systemd unit file (line X-X) has been extended to not only start the stargz-store
daemon, but also download the release archive, verify its checksum and install the executable on the system.
It might be worth considering to download the archive from a local cache / mirror, depending on how often the cluster nodes reboot, new ones get added and how many nodes the cluster has in total.
Once the MachineConfig is injected into the cluster (oc create -f stargz-worker-machineconfig.yaml
), the machine-config-operator will update all worker nodes and reboot them (the reboot ensures the stargz
and crio
service are started in the right order).
Afterwards, we can confirm that is has been correctly applied by creating another pod and examining the status of the stargz
service on one of the nodes:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
| $ oc run jdk --image=ghcr.io/stargz-containers/tomcat:10.1.0-jdk17-openjdk-bullseye-esgz
pod/jdk created
$ oc get pods -w
NAME READY STATUS RESTARTS AGE IP NODE
jdk 0/1 ContainerCreating 0 4s 10.76.14.16 standard-zjdmw
jdk 1/1 Running 0 12s 10.76.14.16 standard-zjdmw
$ oc -n default debug node/standard-zjdmw -- chroot /host systemctl status stargz.service --output=cat
● stargz.service - Stargz Store plugin for CRI-O
Loaded: loaded (/etc/systemd/system/stargz.service; enabled; vendor preset: disabled)
Active: active (running) since Wed 2022-12-07 09:11:10 UTC; 2min 45s ago
Main PID: 1819824 (stargz-store)
Tasks: 14 (limit: 8601)
Memory: 596.8M
CPU: 9.959s
CGroup: /system.slice/stargz.service
└─ 1819824 /usr/local/bin/stargz-store --log-level=debug --config=/etc/stargz-store/config.toml /var/lib/stargz-store/store
{"level":"debug","msg":"completed to prefetch","time":"2022-12-07T09:13:11.442500820Z"}
{"level":"debug","msg":"completed to prefetch","time":"2022-12-07T09:13:11.442527339Z"}
{"layer_sha":"sha256:9de62bcddd24077a0438f955112816f6f64e01341b7bb862869f37611d338fdc","level":"debug","metrics":"latency","msg":"value=9983.725378 milliseconds","operation":"background_fetch_decompress","time":"2022-12-07T09:13:16.461004747Z"}
{"layer_sha":"sha256:9de62bcddd24077a0438f955112816f6f64e01341b7bb862869f37611d338fdc","level":"debug","metrics":"latency","msg":"value=9983.815997 milliseconds","operation":"background_fetch_total","time":"2022-12-07T09:13:16.461090344Z"}
{"level":"debug","msg":"completed to fetch all layer data in background","time":"2022-12-07T09:13:16.472434433Z"}
{"level":"debug","msg":"completed to fetch all layer data in background","time":"2022-12-07T09:13:16.472476962Z"}
{"layer_sha":"sha256:89400f0cd35f146443d5592c18622391509d8df109f7e3b68e7e41bf1fa6bf42","level":"debug","metrics":"latency","msg":"value=10442.093411 milliseconds","operation":"background_fetch_decompress","time":"2022-12-07T09:13:16.943490865Z"}
{"layer_sha":"sha256:89400f0cd35f146443d5592c18622391509d8df109f7e3b68e7e41bf1fa6bf42","level":"debug","metrics":"latency","msg":"value=10442.154206 milliseconds","operation":"background_fetch_total","time":"2022-12-07T09:13:16.943546689Z"}
{"level":"debug","msg":"completed to fetch all layer data in background","time":"2022-12-07T09:13:16.943564897Z"}
{"level":"debug","msg":"completed to fetch all layer data in background","time":"2022-12-07T09:13:16.943576759Z"}
|
Happy pulling!