Analyzing job durations of systemd services

Every night I’m getting a message after the backup job on my server has finished. I was wondering if it’s just my imagination or if these jobs are actually starting to take longer and longer. Since the backup job is a systemd oneshot service (which is triggered through a systemd timer), I thought there must be an easy way to find this out. And indeed there is! Thanks to the journal facility I have all the logs of the last two years.

You can view all logs available on your system like this:

1
journalctl -u restic-backup.service

So I wrote a Python helper script to parse those logs and find the outliers. It leverages the fact that journalctl will print the following message before and after each unit invocation:

1
2
3
4
Jan 29 03:14:47 02.ht.cubieserver.de systemd[1]: Started $service_description.
Jan 30 01:34:59 02.ht.cubieserver.de systemd[1]: Started $service_description.
# in case of failure:
Mar 26 12:13:28 02.ht.cubieserver.de systemd[1]: Failed to start $service_description.

 

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
#!/usr/bin/env python3

import math
import subprocess
from datetime import datetime, timedelta

UNIT_NAME="restic-backup.service"

proc = subprocess.run(["journalctl", "-u", UNIT_NAME, "-o", "short-iso"], capture_output=True)
proc.check_returncode()
journal_entries = proc.stdout.decode('utf-8').splitlines()

start_stop_entries = [ line for line in journal_entries \
                       if line.find('systemd[1]: Start') != -1 \
                       or line.find('systemd[1]: Failed') != -1 ]

i = 0
days = []
while i < len(start_stop_entries):
    start_line = start_stop_entries[i].split()
    i += 1
    stop_line = start_stop_entries[i].split()
    i += 1

    # example time format: 2019-03-03T01:39:22+0100
    start_time = datetime.fromisoformat(start_line[0].split('+')[0])
    stop_time = datetime.fromisoformat(stop_line[0].split('+')[0])

    print("%s: %sh" % (start_time, stop_time-start_time))
    days.append({
        "date": start_time,
        "duration": stop_time-start_time,
    })

durations = [ day["duration"] for day in days ]
mean_duration = sum(durations, timedelta()) / len(durations)
print("Average job duration:", mean_duration)

var_duration = sum([(d.total_seconds()/60-mean_duration.total_seconds()/60)**2 for d in durations]) / len(durations)
print("Standard deviation:", math.sqrt(var_duration))

days_sorted_duration = sorted(days, key=lambda i: i["duration"])
print("Shortest duration: %s (%s)" % \
      (days_sorted_duration[0]["duration"], days_sorted_duration[0]["date"]))
print("Longest duration: %s (%s)" % \
      (days_sorted_duration[-1]["duration"], days_sorted_duration[-1]["date"]))

Which yields the following example output:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
2021-01-24 01:31:00: 0:02:41h
2021-01-25 01:30:52: 1:25:14h
2021-01-26 01:31:08: 1:23:49h
2021-01-27 01:32:14: 2:15:01h
2021-01-28 01:31:12: 3:52:09h
2021-01-29 01:32:11: 1:42:36h
2021-01-30 01:31:22: 0:03:37h
Average job duration: 0:33:54.763791
Standard deviation: 57.04496154553041
Shortest duration: 0:00:00 (2020-09-14 16:44:46)
Longest duration: 21:25:35 (2019-07-10 01:31:52)

In case you have some data accuracy issues (e.g. because the system rebooted while the backup was running), you will need to perform some manual data cleaning of the journalctl data before feeding it to the python script, or just ignore those items with a regex in the loop.

For the long term, it would be nice to feed this data into a Prometheus / Grafana setup, but that’s a job for another day.