Kubernetes CronJob Monitoring: Catch Missed Schedules
The Cluster Was Fine. The CronJob Wasn't.
TL;DR: Kubernetes CronJobs can miss schedules, get stuck in running state, or be permanently suspended after missing 100 consecutive runs. External dead man's switch monitoring detects these failures because it watches for the absence of success signals rather than relying on Kubernetes internal alerts.
We had a Kubernetes CronJob that ran an ETL pipeline every hour. It worked for months. Then we did a cluster upgrade over a weekend, and the CronJob controller was unavailable for about 20 minutes during the rollout. No big deal -except nobody had set startingDeadlineSeconds, so the missed jobs were just... gone. The pipeline didn't run for 3 hours, and we didn't find out until Monday when someone noticed the dashboards were stale.
Kubernetes CronJobs have several failure modes that aren't obvious until they hit you. Here's what I've learned from running them in production.
Why Do Kubernetes CronJobs Miss Schedules?
The startingDeadlineSeconds Trap
This field controls how long the controller will wait to start a missed job. If you don't set it at all, missed jobs are never rescheduled -they're just gone. If you set it too low, any minor API server delay causes skips.
There's also a fun edge case: if more than 100 schedules are missed within the deadline window, Kubernetes permanently suspends the CronJob. You have to manually unsuspend it.
# This CronJob silently loses any schedule missed during
# controller downtime, node pressure, or API server lag
apiVersion: batch/v1
kind: CronJob
metadata:
name: nightly-backup
spec:
schedule: "0 2 * * *"
# No startingDeadlineSeconds = missed jobs are gone
jobTemplate:
spec:
template:
spec:
containers:
- name: backup
image: myapp/backup:latest
command: ["/bin/sh", "-c", "pg_dump $DATABASE_URL | gzip > /backups/db.sql.gz"]
restartPolicy: OnFailureNode Pressure and Scheduling Failures
The controller creates the Job on time, but the Pod can't get scheduled. Nodes are at capacity, there's a resource crunch, or a node went NotReady. The Pod sits in Pending state. If you're not watching for that specifically, it looks like everything is fine from the CronJob perspective -the Job exists, it's just not running.
concurrencyPolicy Gotchas
The concurrencyPolicy decides what happens when a new schedule fires while the previous Job is still running:
Allow(default) -runs multiple Jobs at once. Fine if your job is idempotent, a disaster if it's not.Forbid-skips the new Job entirely. If a previous run is stuck, every subsequent schedule gets silently dropped. This is the one that burned us.Replace-kills the running Job and starts fresh. If the old job was 90% done, tough luck.
With Forbid, a single stuck job can block the schedule for hours or days. Kubernetes doesn't consider this an error. It's working as designed.
How Do You Add a Dead Man's Switch to a CronJob?
The simplest approach: add a curl at the end of your job command, chained so it only runs on success.
apiVersion: batch/v1
kind: CronJob
metadata:
name: nightly-backup
spec:
schedule: "0 2 * * *"
startingDeadlineSeconds: 3600
concurrencyPolicy: Forbid
jobTemplate:
spec:
backoffLimit: 3
template:
spec:
containers:
- name: backup
image: myapp/backup:latest
command:
- /bin/sh
- -c
- |
set -euo pipefail
pg_dump "$DATABASE_URL" | gzip > /backups/db.sql.gz &&
curl -fsS --retry 3 https://deadping.io/api/ping/your-monitor-id
env:
- name: DATABASE_URL
valueFrom:
secretKeyRef:
name: db-credentials
key: url
restartPolicy: OnFailureIf pg_dump fails, set -e exits the script before the curl. If the pod never gets scheduled, the curl never runs. If a previous run is blocking the schedule under Forbid, no new pod runs and no ping arrives. Every failure mode produces one signal: a missing ping.
Storing the Token as a Secret
kubectl create secret generic deadping \
--from-literal=ping-token=your-monitor-id
# Then reference it in the CronJob spec:
env:
- name: DEADPING_TOKEN
valueFrom:
secretKeyRef:
name: deadping
key: ping-tokenWhat Should Every Production CronJob Have?
After running CronJobs in prod for a few years, here's what I set on every one:
startingDeadlineSeconds-set it. 3600 for hourly jobs, 7200 for daily. Gives the controller room to recover from brief outages.concurrencyPolicy: Forbid-for most jobs. Then monitor so you know when a stuck run is blocking the schedule.backoffLimit: 2-3-retries transient failures without looping forever on persistent ones.- Resource requests and limits -so your pods actually get scheduled under pressure.
- External dead man's switch -the curl at the end of the job. This is the one that catches everything else on this list when it fails.
Why Use External Monitoring for Kubernetes CronJobs?
If your monitoring runs inside the same cluster as your CronJobs, and the cluster has problems, your monitoring has the same problems. An external service like DeadPing watches from outside. If your whole cluster goes sideways, you still get the alert. Set a monitor per critical CronJob, set a grace period that accounts for normal runtime variance, and stop worrying about the 15 different ways Kubernetes can silently skip your jobs. See the Kubernetes integration guide for full CronJob spec examples and API reference for programmatic monitor creation.
Start monitoring in 60 seconds
Free forever for up to 20 monitors. No credit card required.
Get Started Free