Kubernetes CronJob Freshness Monitoring with Prometheus

At Basecamp, we've used cron jobs for years to handle various parts of our workloads with great success. Moving apps to the cloud doesn't change how useful cron jobs are, it just gives us some added challenges if we want to ensure we're monitoring them correctly. We monitor a couple aspects of our CronJobs at Basecamp, primarily Errors and Freshness.

Errors is easy. If we don't complete without errors, let's throw an alert. These jobs don't change much, so it usually means we need to poke at some bit.

Freshness is a bit more difficult. We have an existing system that we've used with nagios for years, and it'd've been a cinch to bring that system over into kubernetes, but when in Rome, you use prometheus.

At a high level, how do we know when a CronJob is stale, or "not-fresh"? Well, when it hasn't run within the last interval defined for that CronJob.

Crons are usually scheduled with a format like this (snippet from wikipedia):

# ┌───────────── minute (0 - 59)
# │ ┌───────────── hour (0 - 23)
# │ │ ┌───────────── day of the month (1 - 31)
# │ │ │ ┌───────────── month (1 - 12)
# │ │ │ │ ┌───────────── day of the week (0 - 6) (Sunday to Saturday;
# │ │ │ │ │                                   7 is also Sunday on some systems)
# │ │ │ │ │
# │ │ │ │ │
# * * * * * <command to execute>

It works fine™, and we can define jobs for every minute (* * * * *) or every 5 minutes (*/5 * * * *), or once per day (0 * * * *), or once per month (0 0 0 * *). It works, but how archaic!

So if we need to know when our job is scheduled to run next, we can just refer to this schedule! Oh, uh.. Hmm. Calculating the schedule for every-minute/hour/day jobs is pretty easy, but parsing this fully looks to be a right pain in the butt. Luckily, we don't have to. Kubernetes has already done that, and has tagged some of our resources, and publishes them to prometheus for us!

kube_cronjob_next_schedule_time - When a CronJob is scheduled to run next
kube_cronjob_status_last_schedule_time - When a CronJob was scheduled to run last
kube_job_status_completion_time - When a Job finished
kube_job_status_start_time - When a Job started
kube_cronjob_status_active - Is a CronJob's Job running right now?

So, if we take "kube_cronjob_next_schedule_time" and subtract "kube_cronjob_status_last_schedule_time", we get an interval, in seconds, of the time between scheduled runs:

max by(namespace, cronjob) (
    kube_cronjob_next_schedule_time - 
    kube_cronjob_status_last_schedule_time
)

This gives us:

{cronjob="my-app-cron-email-worker",namespace="my-app-staging"}         60
{cronjob="my-app-cron-queue-deferred-jobs",namespace="my-app-staging"}  60
{cronjob="my-app-cron-solr-indexing",namespace="my-app-staging"}       300

So, now that we know their interval, let's figure out when the most recent one ran. We'll join "kube_job_status_start_time" across "kube_job_owner" on "owner_name" to make sure we're only looking at CronJob Jobs, and we'll join across "kube_job_labels" to tie the Job to the CronJob via job_name:

max by (namespace, owner_name) (
  kube_job_status_start_time * 
    on (job_name) group_left(owner_name) 
    max by (namespace, owner_name, job_name) (
      kube_job_owner{owner_kind="CronJob"})
)

This gives us:

{namespace="my-app-staging",owner_name="my-app-cron-email-worker"}         1616853876
{namespace="my-app-staging",owner_name="my-app-cron-queue-deferred-jobs"}  1616853910
{namespace="my-app-staging",owner_name="my-app-cron-search-indexing"}      1616853912

And now we'll subtract these from the current time to figure out how many seconds ago they ran, and round it for some clean numbers:

round(
  time() - max by (namespace, owner_name) (
    kube_job_status_start_time * 
      on (job_name) group_left(owner_name) 
      max by (namespace, owner_name, job_name) (
        kube_job_owner{owner_kind="CronJob"})
  )
)

This gets us:

{namespace="my-app-staging",owner_name="my-app-cron-email-worker"}         157
{namespace="my-app-staging",owner_name="my-app-cron-queue-deferred-jobs"}  117
{namespace="my-app-staging",owner_name="my-app-cron-solr-indexing"}        298

This looks pretty good, but what's going on here? We're seeing that our every minute crons are showing more than 60s between when we ran the query and when they started! Enter the Scrape Interval (60s, in our case), and a 15% fudge factor:

round( 
  (time() - max by (namespace, owner_name) (
    kube_job_status_start_time * 
      on (job_name) group_left(owner_name) 
      max by (namespace, owner_name, job_name) (
        kube_job_owner{owner_kind="CronJob"}))
  ) * 0.85 - 60
)

Since we're comparing to the current time via time() (yes, it's not precisely the current time, I know), that means the data we're looking at could be delayed by up to our Scrape Interval. The fudge factor is a number pulled out of a hat to reduce false positives. Reducing this uncertainty is on my todo list!

Now we're getting:

{namespace="my-app-staging",owner_name="my-app-cron-solr-indexing"}         54
{namespace="my-app-staging",owner_name="my-app-cron-email-worker"}         405
{namespace="my-app-staging",owner_name="my-app-cron-queue-deferred-jobs"}    2

And if we refresh this query in the console, we can see the numbers tick up until the cron runs again. NICE. However, the labels are not correct. "owner_name" is not "cronjob", so let's fix that with a quick 2-step of "max without(owner_name)" and "label_replace":

max without(owner_name) (
  label_replace(
    round( 
      (time() - max by (namespace, owner_name) (
          kube_job_status_start_time * 
            on (job_name) group_left(owner_name) 
            max by (namespace, owner_name, job_name) (
              kube_job_owner{owner_kind="CronJob"}))
      ) * 0.85 - 60
    )
  , "cronjob", "$1", "owner_name", "(.*)")
)

So, this is pretty great:

{cronjob="my-app-cron-solr-indexing",namespace="my-app-staging"}         54
{cronjob="my-app-cron-email-worker",namespace="my-app-staging"}         405
{cronjob="my-app-cron-queue-deferred-jobs",namespace="my-app-staging"}    2

Now we just need to compare these values to our intervals that we calculated earlier, and we'll get a list of all CronJobs where the seconds-since-last-started is greater than the CronJob's interval:

max by(namespace, cronjob) (
  kube_cronjob_next_schedule_time - 
  kube_cronjob_status_last_schedule_time
) < max without(owner_name) (
  label_replace(
    round( (time() - max by (namespace, owner_name) (
          kube_job_status_start_time * 
            on (job_name) group_left(owner_name) 
            max by (namespace, owner_name, job_name) (
              kube_job_owner{owner_kind="CronJob"}))
      ) * 0.85 - 60
    )
  , "cronjob", "$1", "owner_name", "(.*)")
)

Um. Turns out I'm still getting a result for one of my crons!

{cronjob="my-app-cron-email-worker",namespace="my-app-staging"}   60

Well, it turns out, this CronJob happens to be scheduled every minute, but has a multi-minute run-time. We're using concurrencyPolicy to ensure we only run one(ish) of these at a time. So, this particular CronJob will be alerting most of the time! UGH.

Hmm, but, that means it's alerting while it is currently running, and that's something that we query, too, with "kube_cronjob_status_active"! So, let's add THAT to the mix, making sure we only bring over the labels we want by using "max by (namespace,cronjob)":

max by(namespace, cronjob) (
  kube_cronjob_next_schedule_time - 
  kube_cronjob_status_last_schedule_time
) < max without(owner_name) (
  label_replace(
    round( 
      (time() - max by (namespace, owner_name) (
          kube_job_status_start_time * 
            on (job_name) group_left(owner_name) 
            max by (namespace, owner_name, job_name) (
              kube_job_owner{owner_kind="CronJob"}))
      ) * 0.85 - 60
    )
  , "cronjob", "$1", "owner_name", "(.*)")
) unless max by (namespace,cronjob) (kube_cronjob_status_active)

And there you have it. A mega expression that will, I think, detect CronJobs that haven't run within their (last period + 15%) seconds.

Can we make this simpler? Probably. Is it 100% correct? Probably not. And I don't know enough about Prometheus internals to say whether or not I'm going to trigger some pathological performance penalty by using this yet.

Some ideas to make it simpler and/or better that I haven't looked into yet:

Turn the CronJob Interval calculation into a record, cronjob_interval
Turn the Seconds Since Last Run calculation into a record, cronjob_last_run
Instead of using a fudge factor, incorporate a calculation based on max job run time or 2x average job * runtime or something.
Add an alert if a CronJob takes x% longer than the max recorded runtime that might indicate a problem