Back to blog
Kubernetes

Kubernetes CronJob Monitoring: Catch Missed Schedules

Matt8 min read

The Cluster Was Fine. The CronJob Wasn't.

TL;DR: Kubernetes CronJobs can miss schedules, get stuck in running state, or be permanently suspended after missing 100 consecutive runs. External dead man's switch monitoring detects these failures because it watches for the absence of success signals rather than relying on Kubernetes internal alerts.

We had a Kubernetes CronJob that ran an ETL pipeline every hour. It worked for months. Then we did a cluster upgrade over a weekend, and the CronJob controller was unavailable for about 20 minutes during the rollout. No big deal -except nobody had set startingDeadlineSeconds, so the missed jobs were just... gone. The pipeline didn't run for 3 hours, and we didn't find out until Monday when someone noticed the dashboards were stale.

Kubernetes CronJobs have several failure modes that aren't obvious until they hit you. Here's what I've learned from running them in production.

Why Do Kubernetes CronJobs Miss Schedules?

The startingDeadlineSeconds Trap

This field controls how long the controller will wait to start a missed job. If you don't set it at all, missed jobs are never rescheduled -they're just gone. If you set it too low, any minor API server delay causes skips.

There's also a fun edge case: if more than 100 schedules are missed within the deadline window, Kubernetes permanently suspends the CronJob. You have to manually unsuspend it.

# This CronJob silently loses any schedule missed during
# controller downtime, node pressure, or API server lag
apiVersion: batch/v1
kind: CronJob
metadata:
  name: nightly-backup
spec:
  schedule: "0 2 * * *"
  # No startingDeadlineSeconds = missed jobs are gone
  jobTemplate:
    spec:
      template:
        spec:
          containers:
          - name: backup
            image: myapp/backup:latest
            command: ["/bin/sh", "-c", "pg_dump $DATABASE_URL | gzip > /backups/db.sql.gz"]
          restartPolicy: OnFailure

Node Pressure and Scheduling Failures

The controller creates the Job on time, but the Pod can't get scheduled. Nodes are at capacity, there's a resource crunch, or a node went NotReady. The Pod sits in Pending state. If you're not watching for that specifically, it looks like everything is fine from the CronJob perspective -the Job exists, it's just not running.

concurrencyPolicy Gotchas

The concurrencyPolicy decides what happens when a new schedule fires while the previous Job is still running:

  • Allow (default) -runs multiple Jobs at once. Fine if your job is idempotent, a disaster if it's not.
  • Forbid -skips the new Job entirely. If a previous run is stuck, every subsequent schedule gets silently dropped. This is the one that burned us.
  • Replace -kills the running Job and starts fresh. If the old job was 90% done, tough luck.

With Forbid, a single stuck job can block the schedule for hours or days. Kubernetes doesn't consider this an error. It's working as designed.

How Do You Add a Dead Man's Switch to a CronJob?

The simplest approach: add a curl at the end of your job command, chained so it only runs on success.

apiVersion: batch/v1
kind: CronJob
metadata:
  name: nightly-backup
spec:
  schedule: "0 2 * * *"
  startingDeadlineSeconds: 3600
  concurrencyPolicy: Forbid
  jobTemplate:
    spec:
      backoffLimit: 3
      template:
        spec:
          containers:
          - name: backup
            image: myapp/backup:latest
            command:
            - /bin/sh
            - -c
            - |
              set -euo pipefail
              pg_dump "$DATABASE_URL" | gzip > /backups/db.sql.gz &&
              curl -fsS --retry 3 https://deadping.io/api/ping/your-monitor-id
            env:
            - name: DATABASE_URL
              valueFrom:
                secretKeyRef:
                  name: db-credentials
                  key: url
          restartPolicy: OnFailure

If pg_dump fails, set -e exits the script before the curl. If the pod never gets scheduled, the curl never runs. If a previous run is blocking the schedule under Forbid, no new pod runs and no ping arrives. Every failure mode produces one signal: a missing ping.

Storing the Token as a Secret

kubectl create secret generic deadping \
  --from-literal=ping-token=your-monitor-id

# Then reference it in the CronJob spec:
env:
- name: DEADPING_TOKEN
  valueFrom:
    secretKeyRef:
      name: deadping
      key: ping-token

What Should Every Production CronJob Have?

After running CronJobs in prod for a few years, here's what I set on every one:

  1. startingDeadlineSeconds -set it. 3600 for hourly jobs, 7200 for daily. Gives the controller room to recover from brief outages.
  2. concurrencyPolicy: Forbid -for most jobs. Then monitor so you know when a stuck run is blocking the schedule.
  3. backoffLimit: 2-3 -retries transient failures without looping forever on persistent ones.
  4. Resource requests and limits -so your pods actually get scheduled under pressure.
  5. External dead man's switch -the curl at the end of the job. This is the one that catches everything else on this list when it fails.

Why Use External Monitoring for Kubernetes CronJobs?

If your monitoring runs inside the same cluster as your CronJobs, and the cluster has problems, your monitoring has the same problems. An external service like DeadPing watches from outside. If your whole cluster goes sideways, you still get the alert. Set a monitor per critical CronJob, set a grace period that accounts for normal runtime variance, and stop worrying about the 15 different ways Kubernetes can silently skip your jobs. See the Kubernetes integration guide for full CronJob spec examples and API reference for programmatic monitor creation.

Start monitoring in 60 seconds

Free forever for up to 20 monitors. No credit card required.

Get Started Free