Celery Beat Monitoring: The Dead Man's Switch Pattern
We Didn't Know Beat Was Dead for 3 Days
TL;DR: Celery Beat tasks fail silently when the beat scheduler crashes, Redis memory fills up, task routing breaks after a code deploy, or timezone mismatches cause skipped executions. Adding a dead man's switch ping at the end of each periodic task detects these failures by treating the absence of success as failure.
At my last job we ran about 40 periodic tasks through Celery Beat. Billing syncs, report generation, cache warming, the usual. One Monday morning the finance team asked why the weekend revenue numbers were missing. Turned out Beat had been OOM-killed on Friday evening. The workers were fine, the broker was fine, the entire app was serving traffic normally. Nobody noticed because nothing errored -tasks just quietly stopped being scheduled.
That's the thing about Beat. When it fails, it fails by doing nothing. There's no health check for "are periodic tasks actually being dispatched?"
Why Does Celery Beat Fail Silently?
I've seen four distinct failure modes, and they all share one property: nothing in your monitoring stack catches them unless you explicitly set it up.
- Beat process dies. Gets OOM-killed, crashes, or the container restarts and nobody configured the supervisor properly. Workers sit idle waiting for work that never comes.
- Broker connection drops. Beat keeps ticking internally but tasks never make it to Redis or RabbitMQ. It doesn't retry failed dispatches -just moves to the next schedule tick.
- Tasks raise exceptions. Unless you've configured a result backend and something to check it, the error goes nowhere. The task ran, failed, and life goes on.
- Worker pool is exhausted. All prefork workers are stuck on long tasks. New periodic tasks pile up in the queue and execute hours late -or get dropped if you've set queue TTL.
What Is the Dead Man's Switch Pattern?
The idea is borrowed from industrial equipment -a switch that triggers when the operator stops holding it. Applied to monitoring: your task pings an external endpoint after every successful run. If the ping stops arriving on schedule, you get alerted.
This is fundamentally different from error alerting. You're not trying to detect failures -you're detecting the absence of success. It catches everything: Beat dying, broker disconnects, unhandled exceptions, worker exhaustion. If the task didn't complete successfully, the ping doesn't fire.
How Do You Add a Dead Man's Switch to Celery Tasks?
The simplest approach -add a requests.get() after your task logic:
import requests
from celery import Celery
app = Celery('myapp', broker='redis://localhost:6379/0')
@app.task(bind=True, max_retries=3)
def nightly_billing_sync(self):
try:
sync_stripe_invoices()
process_pending_charges()
# Everything worked -tell DeadPing
requests.get(
"https://deadping.io/api/ping/your-billing-monitor-id",
timeout=10
)
except Exception as exc:
raise self.retry(exc=exc, countdown=60)The ping only happens after both sync_stripe_invoices() and process_pending_charges() succeed. If either throws, Celery retries, and the ping never fires until there's a clean run.
The Signal-Based Approach
If you'd rather not touch every task function, Celery signals work well. Hook into task_success and map task names to monitor IDs:
from celery.signals import task_success
import requests
MONITOR_MAP = {
"myapp.tasks.nightly_billing_sync": "billing-monitor-id",
"myapp.tasks.daily_report": "report-monitor-id",
"myapp.tasks.cleanup_sessions": "cleanup-monitor-id",
}
@task_success.connect
def ping_on_success(sender=None, **kwargs):
monitor_id = MONITOR_MAP.get(sender.name)
if monitor_id:
try:
requests.get(
f"https://deadping.io/api/ping/{'{monitor_id}'}",
timeout=10
)
except requests.RequestException:
pass # Don't fail the task because monitoring is unreachableNew periodic task? Add one line to the map. That's it.
Which Celery Tasks Should You Monitor?
You don't need a dead man's switch on every task. Focus on the ones where silent failure costs you something real:
- Billing and payments -missed syncs mean revenue leaks, and back-billing customers a week later is a terrible experience
- Data pipelines -stale dashboards lead to bad decisions that nobody traces back to a failed ETL
- Backups -the classic "we find out it's broken when we need it" scenario
- Cleanup jobs -expired sessions, orphaned files, temp data. These pile up quietly and eventually cause real problems
Why Do Grace Periods Matter for Task Monitoring?
A billing sync that takes 5 minutes on a normal day might take 20 minutes on the first of the month. If your monitoring alerts at the 6-minute mark, your team starts ignoring alerts, which defeats the entire point. Set grace periods per monitor -10 minutes of slack for a job that runs hourly is usually enough to filter out the noise while still catching real failures fast.
I built this into DeadPing specifically because I'd been burned by noisy alerting before. You set the schedule, set the grace period, and it only pages you when something is genuinely late. Five minutes to set up per task, and the next time Beat dies at 2am, you'll know within the hour instead of discovering it the following Monday. See the API reference to create monitors programmatically or the CI/CD guide to automate it in your pipeline.
Start monitoring in 60 seconds
Free forever for up to 20 monitors. No credit card required.
Get Started Free