Analyzing job durations of systemd services
Every night I’m getting a message after the backup job on my server has finished. I was wondering if it’s just my imagination or if these jobs are actually starting to take longer and longer. Since the backup job is a systemd oneshot service (which is triggered through a systemd timer), I thought there must be an easy way to find this out. And indeed there is! Thanks to the journal facility I have all the logs of the last two years.
You can view all logs available on your system like this:
journalctl -u restic-backup.service
So I wrote a Python helper script to parse those logs and find the outliers.
It leverages the fact that journalctl
will print the following message before and after each unit invocation:
Jan 29 03:14:47 02.ht.cubieserver.de systemd[1]: Started $service_description.
Jan 30 01:34:59 02.ht.cubieserver.de systemd[1]: Started $service_description.
# in case of failure:
Mar 26 12:13:28 02.ht.cubieserver.de systemd[1]: Failed to start $service_description.
|
|
Which yields the following example output:
2021-01-24 01:31:00: 0:02:41h
2021-01-25 01:30:52: 1:25:14h
2021-01-26 01:31:08: 1:23:49h
2021-01-27 01:32:14: 2:15:01h
2021-01-28 01:31:12: 3:52:09h
2021-01-29 01:32:11: 1:42:36h
2021-01-30 01:31:22: 0:03:37h
Average job duration: 0:33:54.763791
Standard deviation: 57.04496154553041
Shortest duration: 0:00:00 (2020-09-14 16:44:46)
Longest duration: 21:25:35 (2019-07-10 01:31:52)
In case you have some data accuracy issues (e.g. because the system rebooted while the backup was running), you will need to perform some manual data cleaning of the journalctl data before feeding it to the python script, or just ignore those items with a regex in the loop.
For the long term, it would be nice to feed this data into a Prometheus / Grafana setup, but that’s a job for another day.