Today I Learned: kubectl output varies based on kubeconfig
While working on extending a Kubernetes operator built with the Operator SDK, I came across an unexplainable behavior in kubectl
(the Kubernetes CLI) / oc
(the OpenShift CLI), which I want to share in this post.
The details of the operator are not relevant here, but my core goal was retrieving a list of all LoadBalancer services used in the cluster.
Ideally, I would want to use something along the lines of this:
$ kubectl get services --field-selector=spec.type=LoadBalancer
Error from server (BadRequest): Unable to find "/v1, Resource=services" that match label selector "", field selector "spec.type=LoadBalancer": "spec.type" is not a known field selector: only "metadata.name", "metadata.namespace"
Unfortunately, that is not possible because the implementation of field-selectors depends on the resource (Pod, Deployment, Service etc.) and the Service resource only implements field selectors for the resource name and namespace (see Kubernetes issue #77662).
Instead, I needed to use a little shell pipe to extract the desired information:
kubectl get services -A | grep 'LoadBalancer' | awk '{print $5}' | grep -iv 'pending' | sort | uniq
This will print a list of unique load balancer IP addresses used across all namespaces, line-by-line.
The kubectl output should have the following format:
NAMESPACE NAME TYPE CLUSTER-IP EXTERNAL-IP PORT AGE
default kubernetes ClusterIP 172.30.0.1 <none> 443/TCP 35d
ingress router-lb-ingress-1 LoadBalancer 172.30.216.160 123.123.123.123 443:32274/TCP 3d23h
After building the container image for the operator and pushing it into the cluster, my beautiful little shell script would no longer return any output – at least not in the environment it was being executed in by the Operator SDK. When I manually ran the script by exec’ing into the container and executing the script, it worked just fine. I was banging my head against table, trying different inputs, making sure I was correctly capturing the output, but none of it helped.
Eventually, I printed the environment variables available while running the script and found two interesting entries:
KUBECONFIG=/tmp/kubeconfig848223182
K8S_AUTH_KUBECONFIG=/tmp/kubeconfig848223182
It seems that the Operator SDK framework is creating a temporary kubeconfig every time the (Ansible) operator is invoked. Unfortunately, this temporary file is immediately deleted again, so it had to steal during runtime:
kubectl exec -it pod/landb-operator -- sh
while true; do cp /tmp/kubeconfig* /tmp/stolen-kubeconfig > /dev/null && break; done
This will run the cp
command endlessly in a fast loop, but most of the time it fails because there is no kubeconfig file under /tmp
.
However, once the kubeconfig file is there the cp
command will succeed (exit code 0) and subsequently invoke break
, which exits the loop.
Now, I can use this kubeconfig file (inside the container) and look at the output myself:
KUBECONFIG=/tmp/stolen-kubeconfig oc get services -A
NAMESPACE NAME AGE
openshift-cern-node-problem-detector node-problem-detector 35d
openshift-authentication oauth-openshift 35d
The output is quite different from the expected format (see above): all the detailed information about the services are missing! I looked into several possibilities that could cause these differences, but I couldn’t find an explanation for it.
When no kubeconfig is found or given, kubectl defaults to the so-called “in-cluster configuration”.
The in-cluster configuration checks for a service account token located in /var/run/secrets/kubernetes.io/serviceaccount/token. It also checks the two environment variables KUBERNETES_SERVICE_HOST and KUBERNETES_SERVICE_PORT. When it finds all three of these, it knows that it is running inside of a Kubernetes cluster. It then knows that it should read the injected data to talk to the cluster.
Both configurations access the cluster with the same account, thus it does not seem like a permission issue:
$ KUBECONFIG= oc whoami
system:serviceaccount:openshift-cern-landb:landb-operator
$ KUBECONFIG=/tmp/stolen-kubeconfig oc whoami
system:serviceaccount:openshift-cern-landb:landb-operator
$ KUBECONFIG= oc config get-contexts
CURRENT NAME CLUSTER AUTHINFO NAMESPACE
$ KUBECONFIG=/tmp/stolen-kubeconfig oc config get-contexts
CURRENT NAME CLUSTER AUTHINFO NAMESPACE
* openshift-cern-landb/proxy-server proxy-server admin/proxy-server
However, I got a slight hint when running the command with increased log verbosity:
$ KUBECONFIG=/tmp/stolen-kubeconfig oc get services -v=4
I1108 13:39:58.822831 58307 merged_client_builder.go:163] Using in-cluster namespace
I1108 13:39:58.845723 58307 table_printer.go:45] Unable to decode server response into a Table. Falling back to hardcoded types: attempt to decode non-Table object
NAME AGE
landb-operator-metrics 153m
$ KUBECONFIG= oc get services -v=4
I1108 13:40:08.533324 58315 merged_client_builder.go:163] Using in-cluster namespace
I1108 13:40:08.533521 58315 merged_client_builder.go:121] Using in-cluster configuration
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
landb-operator-metrics ClusterIP 172.30.123.157 <none> 8383/TCP,8686/TCP 153m
$ oc version
Client Version: v4.2.0-alpha.0-1007-g25914b8
This line only appears when explicitly setting the kubeconfig file, not when using the in-cluster configuration:
Unable to decode server response into a Table. Falling back to hardcoded types: attempt to decode non-Table object
This seems a bit like a bug, but I haven’t found any other references to it.
In any case, no matter the issue, there is an easy way to fix it:
$ oc get services -A -o custom-columns=EXTERNAL-IP:.status.loadBalancer.ingress[*].ip | grep -v '<none>'
EXTERNAL-IP
123.123.123.123
Lesson learned: never trust CLI output unless explicitly defined!