Effective End-to-End Testing with BATS

In this post I want to share with you how you can use BATS - the Bash Automated Testing System - to create an end-to-end test suite for Kubernetes components. BATS can be used for many different purposes: testing command line tools, the behaviour of entire systems, and even APIs. To illustrate the capabilities of BATS, this post will show some practical examples for the kinds of tests that can be implemented using this framework.

I like BATS because it reduces a lot of boilerplate compared to writing your own test suite from scratch (with regular shell scripts) or using language-specific testing frameworks such as JUnit or Pytest. At the same time, it gives us a lot of flexibility and power because all we are doing is writing shell commands: this means we can use all our favorite tools such as grep, jq, awk and friends. This comes in extremely handy especially when you’re already familiar with shell scripting.

In this post I will walk you through how I created the test suite for my project restic-k8s: a project that brings the capabilities of the restic backup tool to Kubernetes. In a nutshell, restic-k8s offers cronjobs that run the common tasks of creating backups from persistent storage (PVCs), deleting old backups from remote storage, keeping an eye on everything and alerting the administrator when the system encounters an error.

I’d like to ensure this workflow keeps working by creating an end-to-end test suite that runs these actions in a real Kubernetes environment. BATS can help us to mirror the actions of the end user: installing a Helm chart, creating PVCs with data, creating backups etc.

#  Installing BATS

Thanks to the fact that BATS and the various helper libraries are written purely in Bash, it’s very easy to install BATS on any system. The only dependency is Bash version 3.2+ (see Support Matrix). The BATS installation page documents various approaches for installing BATS, here is the one I prefer:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
# install BATS via package manager
brew install bats # mac OS
apt install bats # Debian/Ubuntu
dnf install bats-core # Fedora

# add helper libraries as submodules
mkdir e2e-tests && cd e2e-tests
git submodule add https://github.com/bats-core/bats-support
git submodule add https://github.com/bats-core/bats-assert
git submodule add https://github.com/bats-core/bats-detik
git commit -m "Add BATS helper libraries for E2E tests"

Of course, due to the fact that all the BATS executables are actually just shell scripts (and not binary files), it is also possible to add these files directly to the source code repository (aka. “vendoring”).

#  Getting started

BATS test suites look very similar to regular bash scripts: a shebang at the top, some load directives to include definitions from other files and then the functions themselves. In BATS, functions are prefixed with @test to denote them as test cases (though it should be mentioned that regular Bash functions can still be used!).

Here is a trivial example:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
#!/usr/bin/env bats

@test "Test 1" {
    echo "This test succeeds"
    sleep 2
    true
}

@test "Test 2" {
    echo "This test fails"
    false
}

Let’s save this in a file called simple.bats and run it:

1
2
3
4
5
6
7
8
9
$ bats simple.bats
simple.bats
 ✓ Test 1
 ✗ Test 2
   (in test file simple.bats, line 10)
     `false' failed
   This test fails

2 tests, 1 failure

Running a BATS test case is pretty much like running a shell script with the errexit (set -e) option: it continues running until a command returns a non-zero exit code, then aborts. One neat feature of BATS is that we immediately get a clear indication of which test case and on which line the error occurred, as well as the log output of that particular test case. By default, the log output of passing test cases is omitted to reduce visual clutter (see --show-output-of-passing-tests).

BATS has a TAP-compliant output mode so its output can easily be parsed by other tools:

1
2
3
4
5
6
7
$ bats --tap simple.bats
1..2
ok 1 Test 1
not ok 2 Test 2
# (in test file simple.bats, line 10)
#   `false' failed
# This test fails

Two other useful options are --timing (which shows how long each test case took) and --trace (which prints the commands that ran, like set -x):

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
$ bats --trace --timing simple.bats
simple.bats
 ✓ Test 1 [2000]
 ✗ Test 2 [0]
   (in test file simple.bats, line 11)
     `false' failed
   $ [simple.bats, line 10]
   $ echo "This test fails"
   This test fails
   $ false
   $ false

2 tests, 1 failure in 3 seconds

#  Writing the test suite

Enough of the playground examples, let’s start developing a real-world test suite. As I mentioned in the introduction, the test suite for restic-k8s should simulate the end-user workflow for installing and using the component in a Kubernetes cluster. In particular, this includes installing the Helm chart, creating backups and deleting backups.

We can translate this into a BATS test suite as follows:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
#!/usr/bin/env bats

NAMESPACE=restic-k8s-tests

@test "Deploy Helm chart" {

}

@test "Create PVCs and data" {

}

@test "Run backup job" {

}

Let’s fill in some details for the test cases. This first test case focuses on setting up the prerequisite Kubernetes resources that we will use later on during the test suite (of course, this requires the tests to be executed sequentially in order. I’ll demonstrate a more reliable method later). Note that I’m not checking any outputs or error codes here, but rather rely on the fact that BATS will exit when any of the commands fails (like a shell script with set -e).

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
@test "Create PVCs and data" {
    APP_NAME="app-1"
    # create namespace, PVC and deployment
    cat "fixtures/app.yaml" | APP_NAME="${APP_NAME}" envsubst | kubectl create -f -
    # wait for the pod of the deployment to be running so we can execute commands
    # when the deployment is available, the PVC has also been provisioned
    kubectl wait --for=condition=Available=true -n "${APP_NAME}" "deployment/${APP_NAME}" --timeout=60s
    # write some data to the PVC
    cat "fixtures/index.html" | APP_NAME="${APP_NAME}" envsubst | kubectl exec -it -n "${APP_NAME}" deployment/"${APP_NAME}" -- dd of=/usr/local/apache2/htdocs/index.html
    kubectl exec -n "${APP_NAME}" "deployment/${APP_NAME}" -- dd if=/dev/random of=/usr/local/apache2/htdocs/data1.dat bs=1M count=100
}

Next, we’ll deploy the Helm chart of restic-k8s with a bit of configuration. Thanks to Bash here strings all of the commands and config can be kept within the same file, which I’m a fan of because it improves readability (this concept is called code locality).

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
@test "Deploy Helm chart" {
    kubectl create namespace "$NAMESPACE"
    helm upgrade --install -n "$NAMESPACE" -f -  restic-k8s ../chart  <<EOF
image:
  tag: ${IMAGE_TAG}
restic:
  config:
    RESTIC_REPOSITORY: "s3:minio.minio.svc.cluster.local:9000/my-backup-bucket"
    RESTIC_PASSWORD: "foo.bar.baz"
    AWS_ACCESS_KEY_ID: "minio-admin"
    AWS_SECRET_ACCESS_KEY: "minio-hunter2"
debugToolbox:
  enabled: true
EOF

    # smoke tests
    run kubectl -n "$NAMESPACE" get cronjob/restic-k8s-backup -o yaml
    assert_success
    assert_output --partial "image: ghcr.io/jacksgt/restic-k8s:${IMAGE_TAG}"

    run kubectl -n "$NAMESPACE" get pod
    assert_success
    assert_output --regexp "restic-k8s-toolbox-(.+)([[:space:]]+)(1\/1)([[:space:]]+)Running"
}

At the end I have added some smoke tests that check a few basic about the Helm release. This also show cases one of the helper libraries BATS provides for making test easier to read and write: instead of having to capture the log output and exit code of the command (in this case: kubectl) ourselves, we use the built-in run function which takes care of that. Then we can use the assert_success and assert_output (alongside many other functions) from the bats-assert library to run checks against it (as opposed to checking $?, grepping through the output etc.).

Furthermore, for this particular case there is an even simpler alternative: bats-detik. It comes with a mini DSL that allows us to check the properties of Kubernetes resources directly:

1
2
3
4
5
6
    # smoke tests
    DETIK_CLIENT_NAMESPACE="${NAMESPACE}"
    verify "there is 1 cronjob named 'restic-k8s-backup'"
    verify "'spec.jobTemplate.spec.template.spec.containers[0].image' is 'ghcr.io/jacksgt/restic-k8s:${IMAGE_TAG}' for cronjob named 'restic-k8s-backup'"

    try "at most 10 times every 5s to find 1 pods named 'restic-k8s-toolbox-.+' with 'status.phase' being 'Running'"

Unfortunately, at times it can be a bit difficult to figure out the correct syntax for the DETIK verify and try commands, hence it sometimes prefer to stick to kubect, jq and assert.

I should mention here that regular expressions used in helper libraries such as bats-assert and bats-detik are POSIX Extended Regular Expressions (ERE) which are subtly different to the ones you’re probably used to, so it’s a good idea to always double-check the regexes to keep your sanity (see also: regular expression matching).

#  Advanced features

At this point the new test suite has some basic actions and checks. Most prominently, it deploys applications into various Kubernetes namespaces. But something should clean up these resources after we are done with our tests, to avoid lingering resources which could interfere with the next test run. For this purpose, BATS offers the teardown and teardown_file functions which always run, even one or more test cases fail. The teardown function runs after each test case, whereas the teardown_file function runs after all tests in the file have executed.

Correspondingly, BATS also offers the setup and setup_file functions which run before test cases are executed.

Whether you use the test specific setup/teardown or test suite wide setup_file/teardown_file functions depends on whether all test cases are isolated from each other or if they have some dependencies. In this particular example, I don’t want to install the Helm chart and example applications before each test case and then uninstall them again afterwards (that would make it quite slow), so instead I’m relying on the fact that the test cases will be executed in order from top to bottom. In case you’re interested, BATS also offers the possibility to randomize the order of tests and even parallelize them.

In any case, it’s best practice to keep these declarations at the top of the file so they are easy to find.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
NAMESPACE=restic-k8s-tests

# will be run before each test case
setup() {
    true
}

# will be run at the beginning of the file (before the first test)
setup_file() {
    kubectl create namespace "${NAMESPACE}"
}

# will be run after each test case
teardown() {
    true
}

# will be run after all test cases in the file (after the last test)
teardown_file() {
    kubectl delete --wait=true --ignore-not-found=true namespace "${NAMESPACE}" "app-1"
}

While we are talking about advanced functionality, I should also mention that BATS can automatically retry failed tests. Of course - in theory - tests should always be fully reliable and repeatable, we all know that in practice this sometimes looks different, especially when we are talking about end-to-end integration tests that utilize real resources (be it hardware resources or software resources). For this purpose, the BATS_TEST_RETRIES global variable can be set. This will allow all tests in a file to be retried up to N times. In addition, BATS_TEST_TIMEOUT can be used to limit the maximum amount of a time an individual test is permitted to run. This is useful for preventing scenarios where the test case may be waiting forever for a particular state (e.g. waiting for a Deployment to become available, but the Pods are in CrashLoopBackOff).

#  More examples

At this point we have a pretty good overview and understanding of BATS' features. Let’s write some more test cases.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
@test "Run backup job and verify snapshots are present" {
    DETIK_CLIENT_NAMESPACE="$NAMESPACE"
    # start a new backup job
    kubectl -n "$NAMESPACE" create job test-backup --from=cronjob/restic-k8s-backup
    # job must be running
    try "at most 10 times every 5s to find 1 pod named 'test-backup-.+' with 'status' being 'running'"
    # wait for job to complete
    try "at most 10 times every 10s to find 1 job named 'test-backup' with '.status.succeeded' being '1'"
    # get job output
    run kubectl -n "$NAMESPACE" logs job/test-backup
    assert_success
    assert_output --partial "created restic repository"

    APP_NAME="app-1"
    assert_output --partial "Backing up PVC ${APP_NAME}/${APP_NAME} with 'hostPath' strategy"
    assert_output --regexp "Pod backup-${APP_NAME}-.+ terminated after .+: Succeeded"


    # make sure we have exactly one snapshot for the PVC
    run kubectl -n "${NAMESPACE}" exec deploy/restic-k8s-toolbox -- restic snapshots --tag namespace=${APP_NAME},persistentvolumeclaim=${APP_NAME} --json
    assert_success
    assert [ $(echo "$output" | jq ". | length") -eq "1" ]
}

 

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
@test "Run forget job" {
    DETIK_CLIENT_NAMESPACE="$NAMESPACE"
    # start a new backup job
    kubectl -n "$NAMESPACE" create job test-forget --from=cronjob/restic-k8s-forget
    # wait for job to complete
    try "at most 10 times every 10s to find 1 job named 'test-forget' with '.status.succeeded' being '1'"
    # get job output
    run kubectl -n "$NAMESPACE" logs job/test-forget
    assert_success
    assert_output --partial "Applying Policy: keep 1 latest snapshots"
    APP_NAME="app-1"
    assert_output --regexp "Pod forget-${APP_NAME}-.+ terminated after .+: Succeeded"
}

Now we run the test suite and check the results:

1
2
3
4
5
6
7
8
$ bats 01_basic.bats
01_basic.bats
 ✓ Deploy Helm chart
 ✓ Create PVCs and data
 ✓ Run backup job and verify snapshots are present
 ✓ Run forget job and verify snapshots have been removed

4 tests, 0 failures

Excellent!

If you’d like to see a fully fleshed out test suite, you can take a look at the e2e-tests folder in the restic-k8s repository and the associated CI pipeline.

#  Expert features

In this section I’d like to share some neat tricks that I have acquired over the years while writing BATS tests.

One of them is provenance: when your test suite grows large and deploys many components, it can sometimes be hard to track down which test case created a Pod/Deployment/PersistentVolumeClaim/…

BATS has a number of special variables and in particular the BATS_TEST_FILENAME and BATS_TEST_NAME environment variables can help us to “mark” resources so we can later figure out at which stage of the test suite they were created. Let’s say we need to create a PVC, we can add the value of these variables as annotations to the resource like this:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
# 05_complex.bats

@test "Create PVC" {
    kubectl create -f - <<EOF
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: my-pvc
  namespace: my-app
  annotations:
    bats-test-file: "${BATS_TEST_FILENAME}"
    bats-test-name: "${BATS_TEST_NAME}"
spec:
  storageClassName: null # use default
  accessModes:
    - ReadWriteOnce
  resources:
    requests:
      storage: 1Gi
EOF

}

The created resource will then look like this:

1
2
3
4
5
6
7
8
9
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: my-pvc
  namespace: my-app
  annotations:
    bats-test-file: "05_complex.bats"
    bats-test-name: "Create PVC"
...

I don’t recommend using these variables for the name, namespace or labels of Kubernetes resources, since these fields have pretty strict constraints regarding which characters are allowed.


Another useful technique is skipping the teardown functions with a custom variable. As explained in the Advanced features section earlier, BATS allows us to declare teardown (runs after each test case) and teardown_file (runs after all tests in a file) functions. This is generally desired, but sometimes it is necessary to avoid the teardown, for example to troubleshoot why an assertion is failing. For this reason I like to start the teardown functions with the following statements:

1
2
3
4
5
6
7
teardown() {
    if [ "$CUSTOM_BATS_SKIP_TEARDOWN" = "true" ]; then
        return 0
    fi

    # your teardown code here
}

With this in place, we can skip the teardown functionality by simply running export CUSTOM_BATS_SKIP_TEARDOWN=true before executing BATS.


My last tip for you is to use the --dry-run=server option whenever possible. Consider the following scenario: you have set a RBAC policy that allows users to create Deployment resources, but not Ingress resources. To test this RBAC policy you can run kubectl create deployment ... and kubectl create ingress ..., but this would actually spin up pods and potentially provision other resources. Instead what you can do is use kubectl create --dry-run=server. The request will be sent to the Kubernetes API just the same, RBAC policy will be checked, Validating and Mutating Admission Webhooks will run, but no resources will actually be created. Conveniently, this also means you need to clean up less!

#  Conclusion

In short, BATS is an excellent framework for flexible and maintainable test suites. It significantly reduces the boilerplate you need to write while at the same time retaining the powerful paradigms of shell scripting: expressiveness and text manipulation. Error handling can be a bit tricky with plain shell scripts, but the BATS assert library helps us a great deal to make sure our tests are correct.

We’ve covered quite a lot in this post: we’ve seen how to put together a simple BATS test suite and which output formatting options are available, we’ve explored advanced features of BATS that help us reduce boilerplate (such as setup and teardown functions), and looked at BATS helper libraries that make writing tests and assertions even easier.

While I have focused heavily on testing an application in a Kubernetes environment, I hope my examples made it clear that this is just one particular use case and that BATS can be used for many other scenarios as well.

Also if you are interested in restic-k8s - the backup tool I’m writing that brings restic capabilities to Kubernetes - check it out on GitHub. There you can also find a fully fleshed out BATS test suite if you are looking for more real-world examples.

I hope you enjoyed reading this post and I would love to hear from you what you’re using BATS for.

Happy testing!