Back to blog
PostgreSQL

How to Monitor pg_dump Backups with a Dead Man's Switch

Matt7 min read

The Backup That Wasn't There

TL;DR: pg_dump backups fail silently when passwords rotate, disk runs out, connections drop, or the dump file is created but empty. A bulletproof monitoring script verifies the file exists, exceeds a minimum size threshold, and only then pings a dead man's switch to confirm the backup actually succeeded.

Last year I got the call every engineer dreads. Production database corrupted, we need to restore from backup. I SSH'd into the backup server, navigated to the dump directory, and found the most recent file was 11 days old. The cron job that ran pg_dump every night had been failing silently since a password rotation. The script exited with a non-zero code, cron dutifully logged it to a file nobody read, and life went on. Eleven days of data, gone.

That experience taught me that running pg_dump on a schedule is only half the job. The other half is proving it actually worked. Here's how I set up PostgreSQL backup scripts now, with proper error handling and a dead man's switch that alerts me the moment a backup fails to complete.

Why Does pg_dump Fail Silently?

The root cause is almost always the same: the backup script doesn't use strict error handling. By default, bash scripts keep running after a command fails. pg_dump returns a non-zero exit code, the next line runs anyway, and your cron job reports success to nobody.

Common failure modes I've seen in production:

  • Authentication expired. Someone rotated the database password or the .pgpass file got overwritten during a deploy. pg_dump fails immediately, zero-byte file left behind.
  • Disk full. The dump starts writing, fills the partition, and pg_dump exits with an error. You're left with a truncated file that looks like a backup but won't restore.
  • Connection timeout. Network blip between the backup server and the database. The connection drops mid-stream, partial file, no alert.
  • Lock contention. A long-running transaction holds a lock that pg_dump needs. It waits, eventually times out or gets killed by the OOM reaper.
  • Out of memory. Large databases withpg_dump --format=custom can consume significant RAM during compression. The kernel kills the process silently.

How Do You Write a Bulletproof pg_dump Script?

The foundation is set -euo pipefail at the top of every backup script. This makes bash exit on any error, treat unset variables as errors, and catch failures in piped commands. Without this, you're flying blind.

#!/bin/bash
set -euo pipefail

# Configuration
DB_HOST="your-db-host.example.com"
DB_NAME="production"
DB_USER="backup_user"
BACKUP_DIR="/var/backups/postgresql"
RETENTION_DAYS=30
TIMESTAMP=$(date +%Y%m%d_%H%M%S)
BACKUP_FILE="${BACKUP_DIR}/${DB_NAME}_${TIMESTAMP}.sql.gz"
MIN_SIZE_BYTES=1048576  # 1MB minimum - adjust for your database

# Ensure backup directory exists
mkdir -p "${BACKUP_DIR}"

echo "[$(date)] Starting backup of ${DB_NAME}..."

# Run pg_dump with compression
pg_dump \
  -h "${DB_HOST}" \
  -U "${DB_USER}" \
  -d "${DB_NAME}" \
  --no-owner \
  --no-privileges \
  --verbose 2>&1 | gzip > "${BACKUP_FILE}"

# Validate file exists and meets minimum size
ACTUAL_SIZE=$(stat -f%z "${BACKUP_FILE}" 2>/dev/null || stat -c%s "${BACKUP_FILE}")
if [ "${ACTUAL_SIZE}" -lt "${MIN_SIZE_BYTES}" ]; then
  echo "ERROR: Backup file is ${ACTUAL_SIZE} bytes, expected at least ${MIN_SIZE_BYTES}"
  rm -f "${BACKUP_FILE}"
  exit 1
fi

echo "[$(date)] Backup complete: ${BACKUP_FILE} (${ACTUAL_SIZE} bytes)"

# Clean up old backups
find "${BACKUP_DIR}" -name "*.sql.gz" -mtime +${RETENTION_DAYS} -delete

# Signal success to DeadPing
curl -fsS --retry 3 --max-time 10 https://deadping.io/api/ping/YOUR_TOKEN

echo "[$(date)] Backup verified and monitoring pinged successfully"

There are a few things to notice here. The set -euo pipefailat the top means if pg_dump fails, the whole script stops immediately. The gzip pipe is also covered because of pipefail. The file size check catches the sneaky case wherepg_dump "succeeds" but writes an empty or nearly-empty file because the database was unreachable and it dumped nothing useful.

What Size Threshold Should You Set?

This depends entirely on your database. A good starting point: look at your last 10 successful backups, find the smallest one, and set your threshold to 50% of that. For a database that normally dumps to 2GB compressed, a 1GB minimum is reasonable. For a small app database that dumps to 50MB, set it to 20MB. The point isn't precision -it's catching catastrophic failures like zero-byte files or truncated dumps.

How Do You Handle pg_dump Authentication Securely?

Don't put passwords in the script. Use a .pgpass file or environment variables:

# Option 1: .pgpass file (chmod 600)
# ~/.pgpass format: hostname:port:database:username:password
echo "your-db-host.example.com:5432:production:backup_user:secretpass" > ~/.pgpass
chmod 600 ~/.pgpass

# Option 2: Environment variable in crontab
# In /etc/cron.d/backup:
PGPASSWORD=secretpass
0 2 * * * backupuser /opt/scripts/backup-pg.sh >> /var/log/pg-backup.log 2>&1

If you're on AWS, use IAM database authentication or pull credentials from Secrets Manager at runtime. This way password rotations don't break your backups -but you should still monitor for it.

Why Isn't Cron's Exit Code Enough?

Cron doesn't alert you when a job fails. It writes to the system mail spool, which nobody has configured since 2008. Even if you redirect output to a log file, someone needs to actually read that log. I've worked at companies where backup logs had months of FATAL: password authentication failed lines and nobody noticed.

The dead man's switch approach flips this completely. Instead of trying to detect failures, you detect the absence of success. The curl at the end of the script only runs if everything above it succeeded -the dump, the size validation, all of it. If that ping doesn't arrive on schedule, something went wrong, and you need to investigate.

How Do You Set Up the Monitoring?

The curl call at the end of the script is the entire integration. Let's break down the flags:

curl -fsS --retry 3 --max-time 10 https://deadping.io/api/ping/YOUR_TOKEN
  • -f -fail silently on HTTP errors (don't output error HTML)
  • -sS -silent but show errors
  • --retry 3 -retry up to 3 times if the request fails (handles transient network issues)
  • --max-time 10 -don't let a slow response hang your backup script

In DeadPing, you create a monitor with a 24-hour expected interval and a grace period that accounts for how long your backup takes. If your dump normally finishes in 20 minutes, a 1-hour grace period is reasonable. If the ping doesn't arrive within 25 hours of the last one, you get an alert via email, Slack, or PagerDuty.

I set this up for every database I manage now. It took 10 minutes per server, and it's caught two authentication failures and one disk space issue in the past six months -all before anyone noticed data was missing. Check out the getting started guide to create your first monitor, or see the API reference if you want to automate monitor creation across multiple servers.

Start monitoring in 60 seconds

Free forever for up to 20 monitors. No credit card required.

Get Started Free