ArgoCD Health Checks for OPA rules

ArgoCD is commonly used to deploy and manage resources in Kubernetes clusters. One nice feature of ArgoCD is that it will continuously monitor the status of your resources (unlike for example Helm, which just creates the resources). For example, when your deployment fails to scale up to the desired number of replicas, ArgoCD will mark the Deployment as “Unhealthy”.

ArgoCD has many resource health checks already built-in, but of course it cannot possibly cover the breadth of all resources available for Kubernetes. Furthermore, even if ArgoCD knows how to check the status of the resources, there might be additional constraints or conditions that ArgoCD does not take into account. One such example are policy rules for OpenPolicyAgent (OPA), which are stored in ConfigMaps when using OPA with kube-mgmt. In this case, the ConfigMap is always “Healthy” as long as it exists. But OPA adds an annotation to the ConfigMap which indicates if it was able to successfully parse the policy stored in the ConfigMap. Thus, I wanted to reflect this information in ArgoCD.

Why? Because it makes troubleshooting simpler and provides quicker feedback: it is much easier to identify an ArgoCD application which is in state “Unhealthy” (with a big red heart if you’re using the UI) rather than go through all its associated ConfigMaps and check if the value of a particular annotation.

My first step was adding an invalid snippet into one of our OPA policies and deploying it to the cluster:

package kubernetes.admission
import data.kubernetes.storageclasses

=this{}is[]invalid
# ...

As this picture shows, ArgoCD still believes the resource is “Healthy” and therefore marks the entire application as “Healthy”.

Next, I set out to write a custom health check for ArgoCD. ArgoCD health checks (and resource actions) are written in Lua, a lightweight scripting language. If you are new to Lua (like me!), I recommend reading these short introductions for “Programming in Lua”: Types and Values as well as Tables. The health checks already defined in the ArgoCD repo serve as good examples as well.

After a bit of trial-and-error, I came up with the following snippet:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
hs = {}
hs.status = "Healthy"
local opa_annotation = "openpolicyagent.org/policy-status"
if obj.metadata.annotations ~= nil then
  if obj.metadata.annotations[opa_annotation] ~= nil then
    if obj.metadata.annotations[opa_annotation] == '{"status":"ok"}' then
      hs.status = "Healthy"
      hs.message = "Policy loaded successfully"
    else
      hs.status = "Degraded"
      hs.message = obj.metadata.annotations[opa_annotation]
    end
  end
end
return hs

hs is the health status object we will return to ArgoCD. It must contain a status attribute which indicates whether the resource is Healthy, Progressing, Degraded or Suspended. By default we set it to Healthy, since we don’t want to mess with the status of other, non-OPA ConfigMaps. Optionally, the health status object may also contain a message.

In lines 4-6 we identify if the ConfigMap is indeed an OPA policy or another kind of ConfigMap. If it is a OPA policy, we retrieve the value of the openpolicyagent.org/policy-status annotation (line 7). The annotation is set to {"status":"ok"} if the policy was loaded successfully, if errors occurred during loading (e.g., because the policy contained a syntax error) the cause will be reported in the annotation. Depending on the value of the annotation, we set the status and message attributes appropriately.

At the end, we return the hs object to ArgoCD.

The only step left is telling ArgoCD about our custom health check. This is done by adding the snippet about into the ArgoCD configuration in argo-cm ConfigMap:

1
2
3
4
5
6
7
apiVersion: v1
kind: ConfigMap
metadata:
  name: argocd-cm
data:
resource.customizations.health.ConfigMap: |
  <our-snippet>

This health check will apply to all resources of type ConfigMap. Note that usually you need to use the format GROUP_RESOURCE (like argoproj.io_Application), however since core resources don’t have a group only the resource name is used an identifier.

If you are using the ArgoCD Helm chart you can directly inject the snippet into the argo-cd.server.config value.

After ArgoCD reloads its configuration, it shows our OPA policy as “Unhealthy”!

In addition, we get a nice error message in ArgoCD which immediately tells us what’s wrong with the resource:

Happy (and healthy!) deploying!