Degraded Operator-Lifecycle-Manager-Packageserver ClusterOperator on OpenShift

Last week all of our OpenShift (OKD) clusters started alerting us about the same degraded condition:

1
2
3
4
5
6
7
8
9
alertname = ClusterOperatorDown
name = operator-lifecycle-manager-packageserver
namespace = openshift-cluster-version
openshift_io_alert_source = platform
prometheus = openshift-monitoring/k8s
reason = ClusterServiceVersionNotSucceeded
severity = critical
description = The operator-lifecycle-manager-packageserver operator may be down or disabled because $ClusterServiceVersionNotSucceeded, and the components it manages may be unavailable or degraded.
summary = Cluster operator has not been available for 10 minutes.

The clusters were running fine and user workloads were not degraded, but one cluster operator, specifically the Operator-Lifecycle-Manager-Packageserver operator, was degraded due to:

1
2
3
4
5
$ oc get -o yaml clusteroperator operator-lifecycle-manager-packageserver

ClusterServiceVersion openshift-operator-lifecycle-manager/packageserver
      observed in phase Failed with reason: APIServiceResourceIssue, message: found
      the serving cert not active'

Searching on the web revealed that we are not the only ones encountering this issue (see references at the bottom). To allow secure and encrypted communication between the OLM package server and the rest of the control plane components, a certificate is generated. Usually OpenShift is very good about automatically rotating certificates before they expire, but this case was not the case here (upstream bug: OCPBUGS-25341).

1
2
3
4
5
$ oc get secret packageserver-service-cert -n openshift-operator-lifecycle-manager  -o jsonpath='{.data.tls\.crt}' \
    | base64 -d | openssl x509 -noout -dates

notBefore=Feb 24 10:07:06 2022 GMT
notAfter=Feb 23 10:07:05 2024 GMT

Generating a fresh certificate is easy enough by deleting the existing secret:

1
oc delete secret packageserver-service-cert -n openshift-operator-lifecycle-manager

After a couple of seconds a new certificate is generated, the Operator-Lifecycle-Manager picks it up automatically and the control plane is happy again.

1
2
3
$ oc get co

operator-lifecycle-manager-packageserver   4.13.0-0.okd-2023-09-30-084937   True        False         False      2s

Why did this happen today? And why on all of the clusters at the same time? Some researching revealed that the Operator-Lifecycle-Manager package-server-manager component was first introduced with OpenShift 4.9. This can be confirmed by looking at the creation timestamp of the related namespace:

1
2
$ oc -n openshift-operator-lifecycle-manager get deploy package-server-manager -o jsonpath='{.metadata.creationTimestamp}'
2022-02-24T09:46:24Z

and comparing to the date we upgraded our clusters to release 4.9:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
$ oc get clusterversion version -o yaml

status:
  history:
  ...
  - completionTime: "2022-02-24T10:47:28Z"
    startedTime: "2022-02-24T09:14:52Z"
    state: Completed
    verified: false
    version: 4.9.0-0.okd-2022-02-12-140851

The timestamps match! That explains why all our clusters (which are deployed in completely isolated environments) encountered this condition at the same time.

#  References