Back to blog
DevOps

Why Your Nightly Backup Might Not Be Running

Matt6 min read

That Backup You Trust? Check It.

TL;DR: Nightly database backups fail silently when credentials expire, disk space runs out, network timeouts interrupt uploads, or pg_dump exits with a non-zero code that cron ignores. A dead man's switch that only fires after verifying backup file size and integrity catches every failure mode.

I've worked at three companies where the nightly backup was a shell script somebody wrote years ago, added to a crontab, and never looked at again. "The backups run every night" was the standard answer when anyone asked. Nobody verified it. Nobody checked the output. The backup was Schrödinger's cat -simultaneously working and not working until you actually needed to restore.

Here's how I've seen backups fail, and what I do now to make sure I know about it the same night.

Why Does pg_dump Fail Silently?

This is the most common and most dangerous pattern I see:

pg_dump mydb | gzip > /backups/nightly.sql.gz

If pg_dump fails (wrong password, can't connect, permission denied), gzip still succeeds -it happily compresses an empty stream. The pipeline exit code is the exit code of the last command, so the overall exit is 0. Your script thinks it worked. You've got a compressed empty file that looks like a backup.

#!/bin/bash
set -euo pipefail  # This is the fix

pg_dump mydb | gzip > /backups/nightly.sql.gz

With pipefail, the pipeline returns the exit code of the rightmost command that failed. Now pg_dump errors actually propagate. I put set -euo pipefail at the top of every script I write now.

What Happens When Backup Credentials Expire?

I once spent 2 hours debugging a backup failure that turned out to be an expired AWS IAM key. The pg_dump worked fine locally, the file was there, but the aws s3 cp step had been failing for 10 days. The local backup existed, so a naive "does the file exist?" check passed. The off-site backup was missing.

Database passwords change, IAM keys rotate, service account tokens expire. If your backup pipeline has any step that touches external credentials, that step will eventually break, and the cron entry will keep running the broken script every night like nothing is wrong.

What Happens When the Backup Disk Fills Up?

Databases grow. A backup that was 500MB a year ago might be 4GB now. If nobody monitors disk usage on the backup volume, you run out of space and the next backup is truncated or zero bytes. I've added a simple size check to every backup script since:

#!/bin/bash
set -euo pipefail

pg_dump "$DATABASE_URL" | gzip > /tmp/nightly.sql.gz

# Sanity check -is this file at least 10KB?
ACTUAL_SIZE=$(stat --format=%s /tmp/nightly.sql.gz 2>/dev/null || echo 0)
if [ "$ACTUAL_SIZE" -lt 10000 ]; then
  echo "Backup is only $ACTUAL_SIZE bytes -something is wrong" >&2
  exit 1
fi

mv /tmp/nightly.sql.gz /backups/nightly.sql.gz

How Do Network Timeouts Break Backup Uploads?

Uploading a multi-gig backup to S3 can take minutes. Network blips, throttling, DNS issues -any of these can kill the upload silently. Most cloud CLIs have retry flags, but they're not always the default behavior.

aws s3 cp /backups/nightly.sql.gz s3://my-backups/ \
  --cli-read-timeout 300 \
  --cli-connect-timeout 30

Why Isn't Checking the Crontab Enough?

"The backup runs every night" means "there's a crontab entry." That tells you nothing about whether the job actually ran last night, whether it succeeded, or whether the output is usable. Real verification needs three levels:

  1. Did it run? -not "is it scheduled" but "did it actually execute in the last 24 hours?"
  2. Did it succeed? -exit code 0 and a reasonably sized output file
  3. Can you restore from it? -periodic restore tests (that's a whole separate practice)

A dead man's switch covers the first two. If the job didn't run or didn't succeed, the ping never fires, and you know about it within the hour.

How Should a Production Backup Script Look?

#!/bin/bash
set -euo pipefail

BACKUP_FILE="/tmp/nightly-$(date +%Y%m%d).sql.gz"

# Dump and compress
pg_dump "$DATABASE_URL" | gzip > "$BACKUP_FILE"

# Verify it's not tiny
ACTUAL_SIZE=$(stat --format=%s "$BACKUP_FILE")
if [ "$ACTUAL_SIZE" -lt 10000 ]; then
  echo "Backup too small: $ACTUAL_SIZE bytes" >&2
  exit 1
fi

# Upload to S3
aws s3 cp "$BACKUP_FILE" "s3://my-backups/nightly/" \
  --cli-read-timeout 300

# Clean up local file
rm "$BACKUP_FILE"

# Everything worked -ping DeadPing
curl -fsS --retry 3 https://deadping.io/api/ping/your-monitor-id

If any step fails (the dump, the size check, the upload) the script exits before the curl. DeadPing notices the missing ping and sends me an alert. No log parsing, no hoping someone reads the output. I know the same night. See the Docker guide for containerized backup examples or the Terraform guide to manage monitors as infrastructure.

Start monitoring in 60 seconds

Free forever for up to 20 monitors. No credit card required.

Get Started Free