Python

Airflow DAG Monitoring with Dead Man's Switches

Matt2026-03-078 min read

The DAG That Ran on Time but Didn't Actually Run

TL;DR: Airflow DAGs fail silently due to scheduler lag, pool exhaustion, zombie tasks, SLA misses, and upstream dependency failures. Because Airflow only monitors what happens inside its own scheduler, adding an external dead man's switch ping at the end of critical DAGs catches failures that Airflow itself cannot detect.

I spent a painful week last year debugging a data pipeline that looked perfectly healthy in the Airflow UI. Green checkmarks everywhere. The DAG ran on schedule, every task showed success. Except the final load task was silently writing zero rows because an upstream API had changed its response format. The task didn't error -it parsed the empty result set, loaded nothing, and marked itself complete. We served stale analytics dashboards for six days before a product manager noticed the numbers looked "off."

Airflow is powerful, but its built-in monitoring is focused on task execution, not task outcomes. Here's how to add external dead man's switch monitoring that catches the failures Airflow won't.

Why Does Airflow Miss Silent Failures?

Airflow's scheduler and executor are designed to run tasks and track their exit codes. A task that exits 0 is a success, full stop. But there are several categories of failure that exit 0:

Scheduler lag. The scheduler scans DAGs on an interval. Under heavy load or with hundreds of DAGs, the scan loop falls behind. Your hourly DAG might actually run every 90 minutes. Airflow logs the delay but doesn't alert on it by default.
Pool exhaustion. If all slots in a pool are occupied, tasks queue up. A task scheduled for 2am might not actually execute until 6am. The Airflow UI shows it as "queued" but you have to be looking at the right moment to notice.
Zombie tasks. The worker process dies mid-task (OOM, spot instance termination), but the metadata database still shows the task as "running." Airflow has zombie detection, but it's slow -it relies on heartbeat timeouts that default to 5 minutes, and in practice I've seen zombies linger for hours.
SLA misses. Airflow has SLA miss detection built in, but almost nobody configures it. You have to set sla on each task, configure the sla_miss_callback, and make sure the callback actually reaches someone. In practice, teams set it up during initial development and never update the thresholds as data volumes grow.
Upstream dependency failures. A sensor waits for a file that never arrives. It times out after the default poke_timeout (7 days!) and marks itself as failed. By then, the downstream dashboard has been stale for a week.

How Do You Add a Dead Man's Switch to an Airflow DAG?

The most straightforward approach is to add a final task in your DAG that pings an external monitoring endpoint. This task only runs if all upstream tasks succeed (the default behavior for Airflow task dependencies). Here's how to do it with SimpleHttpOperator:

from airflow import DAG
from airflow.operators.python import PythonOperator
from airflow.providers.http.operators.http import SimpleHttpOperator
from datetime import datetime, timedelta

default_args = {
    'owner': 'data-team',
    'retries': 2,
    'retry_delay': timedelta(minutes=5),
}

with DAG(
    'daily_etl_pipeline',
    default_args=default_args,
    schedule='0 2 * * *',  # Daily at 2am
    start_date=datetime(2026, 1, 1),
    catchup=False,
) as dag:

    extract = PythonOperator(
        task_id='extract',
        python_callable=extract_data,
    )

    transform = PythonOperator(
        task_id='transform',
        python_callable=transform_data,
    )

    load = PythonOperator(
        task_id='load',
        python_callable=load_data,
    )

    ping_monitor = SimpleHttpOperator(
        task_id='ping_deadping',
        http_conn_id='deadping',  # Configure in Airflow connections
        endpoint='/api/ping/YOUR_TOKEN',
        method='GET',
        retries=3,
        retry_delay=timedelta(seconds=30),
    )

    extract >> transform >> load >> ping_monitor

The ping_monitor task sits at the end of the dependency chain. It only executes if extract, transform, and load all succeed. If any upstream task fails (even after retries), the ping never fires, and your dead man's switch alerts you.

You'll need to create an Airflow connection called deadping with the host set to https://deadping.io. Do this in the Airflow UI under Admin > Connections, or via the CLI:

airflow connections add 'deadping' \
    --conn-type 'http' \
    --conn-host 'https://deadping.io'

What About Using on_success_callback Instead?

If you'd rather not add a task to your DAG, you can use the on_success_callback at the DAG level. This fires after the entire DAG run completes successfully:

import requests

def ping_deadping_on_success(context):
    """Callback that fires when the entire DAG run succeeds."""
    try:
        response = requests.get(
            'https://deadping.io/api/ping/YOUR_TOKEN',
            timeout=10,
        )
        response.raise_for_status()
    except requests.RequestException as e:
        # Log but don't fail the DAG run over a monitoring ping
        print(f"DeadPing ping failed: {e}")

with DAG(
    'daily_etl_pipeline',
    schedule='0 2 * * *',
    start_date=datetime(2026, 1, 1),
    catchup=False,
    on_success_callback=ping_deadping_on_success,
) as dag:

    extract >> transform >> load

I generally prefer the explicit task approach over the callback approach. A task shows up in the Airflow UI graph view, making it visible that this DAG is externally monitored. It also gets its own retries and logs. But the callback approach is cleaner if you're adding monitoring to many DAGs and don't want to modify their dependency chains.

How Do You Monitor for Scheduler Lag and Pool Exhaustion?

The dead man's switch approach handles these naturally. You don't need to directly monitor Airflow's scheduler health or pool utilization. If the scheduler is lagging and your 2am DAG doesn't run until 6am, the ping arrives 4 hours late. If your grace period is set to 2 hours, you'll get alerted at 4am -well before the team starts relying on stale data.

For pool exhaustion specifically, set your grace period based on expected execution time, not just schedule frequency. If your daily ETL normally takes 30 minutes, a 2-hour grace period gives plenty of buffer for slow days while still catching genuine stalls the same morning.

What About Data Quality Checks?

The dead man's switch catches execution failures, but what about my opening scenario -the task that ran successfully but produced bad data? Combine the ping with a validation step:

def validate_and_ping(**context):
    """Validate output data before pinging the monitor."""
    row_count = get_loaded_row_count()  # Your validation logic

    if row_count < 100:  # Minimum expected rows
        raise ValueError(
            f"Data quality check failed: only {row_count} rows loaded, "
            f"expected at least 100"
        )

    # Data looks good - ping the monitor
    requests.get(
        'https://deadping.io/api/ping/YOUR_TOKEN',
        timeout=10,
    )

validate = PythonOperator(
    task_id='validate_and_ping',
    python_callable=validate_and_ping,
)

extract >> transform >> load >> validate

Now the ping only fires if the data meets your quality threshold. Zero rows loaded? No ping. Fewer rows than expected? No ping. The dead man's switch fires, and you investigate before anyone sees bad numbers.

Making It Work Across Dozens of DAGs

If you're running a lot of DAGs, create a factory function that adds monitoring to any DAG:

def add_deadping_monitor(dag, final_task, token):
    """Add a DeadPing monitoring task to the end of a DAG."""
    ping = SimpleHttpOperator(
        task_id='ping_deadping',
        http_conn_id='deadping',
        endpoint=f'/api/ping/{token}',
        method='GET',
        retries=3,
        dag=dag,
    )
    final_task >> ping
    return ping

One line per DAG to add full dead man's switch monitoring. I use this pattern across about 15 DAGs in production right now. DeadPing has caught scheduler stalls twice, a zombie task once, and my embarrassing zero-rows-loaded scenario once more (different API, same mistake). Setup takes about five minutes per DAG -create the monitor, set the schedule and grace period, add the task. Check the getting started guide to create your first monitor, or use the API reference to automate monitor provisioning as part of your DAG deployment pipeline.

Start monitoring in 60 seconds

Free forever for up to 20 monitors. No credit card required.

Get Started Free