Monitor Database Backup Cron Jobs (pg_dump / mysqldump)
Nightly database backups are the last line of defense against data loss, yet the cron jobs that run pg_dump and mysqldump are among the most likely to fail without anyone noticing. A full disk, a rotated password, or a silent non-zero exit code can cause every backup to silently write a zero-byte file for weeks — and you only discover this when you urgently need to restore. The only reliable way to know your backup actually succeeded is to have the backup script itself report success to an external monitor, and raise an alert when that report stops arriving.
Why Database Backup Cron Jobs Fail Without Telling You
pg_dump and mysqldump are command-line tools that print errors to stderr and exit with a non-zero code on failure — but cron does not alert you on non-zero exits. Unless you have explicitly configured cron to mail output somewhere readable, these errors are discarded. The result is a job that runs on schedule, produces nothing useful, logs nothing visible, and registers as healthy in every dashboard you have. This class of failure is especially dangerous for backups because the job is never tested until the moment it matters most: a production incident.
- Disk full on the backup partition: pg_dump and mysqldump stream output to a file. When the target filesystem hits 100% capacity the file is truncated, the tool exits non-zero, and cron silently discards stderr. You are left with a partial or zero-byte dump that appears to exist at the expected path.
- Authentication credentials rotated: database passwords, SSL certificates, and IAM credentials change. The .pgpass file for a postgres system user has strict 0600 permission requirements and is silently ignored if permissions drift. PGPASSWORD set in a user shell session is invisible to the minimal cron environment. The backup starts, cannot authenticate, and exits immediately.
- Cron environment mismatch: cron runs with a stripped-down PATH and no user profile variables. pg_dump or mysqldump binaries installed in /usr/local/bin or via package managers under versioned paths are not found. The script exits with 'command not found' — a message that goes nowhere if stderr is redirected to /dev/null.
- The cron entry was silently removed: system updates, crontab edits by a second operator, user-level crontabs being wiped during account provisioning, or a cron daemon restart on some distributions clears jobs. The backup was scheduled, nothing breaks visibly, and the job simply never runs again.
- Backup completes but produces a corrupt or empty file: mysqldump with --single-transaction can write a structurally valid SQL file that is semantically empty if the dump began while the schema was being altered. pg_dump against a tablespace with missing files exits partially written. File size is non-zero but the dump cannot be restored.
- Remote target unreachable: backup scripts that push to S3, SFTP, or a remote NFS mount can succeed at the dump step but fail at the transfer step. The local file exists, the exit code from the dump tool is 0, but no offsite copy was written. A script that does not explicitly check the transfer exit code will then report false success.
The External Heartbeat Approach: Catch What Logs and Dashboards Miss
An external heartbeat monitor — also called a dead man's switch — inverts the usual monitoring model. Instead of watching for an error signal, it watches for the absence of a success signal. You configure a monitor with an expected period (for example, 24 hours) and a grace window (for example, 15 minutes). Your backup script calls a unique ping URL only after it has verified that the dump completed successfully. If the ping does not arrive within the period plus the grace window, the monitor fires an alert. This approach catches every class of silent failure that internal logging misses. If cron never runs the job, there is no ping. If the script runs but pg_dump exits non-zero and you gate the curl call on a zero exit code check, there is no ping. If the dump writes a zero-byte file and your script validates the file size before pinging, there is no ping. If the cron entry was deleted last week, there is no ping. None of these scenarios produce a log entry you would think to check; all of them produce a missing heartbeat that the external monitor detects automatically. CronJobPro provides this as a heartbeat monitor type. Each monitor gets a unique ping URL at https://cronjobpro.com/ping/<token>. You can also call https://cronjobpro.com/ping/<token>/fail to explicitly report a known failure, and https://cronjobpro.com/ping/<token>/exitcode/<n> to report the raw exit code for automated parsing. Alerts route to email, Slack, Discord, Microsoft Teams, PagerDuty, Opsgenie, or any webhook endpoint you configure. Because the monitor runs entirely outside your infrastructure, it continues alerting even if the server running the cron job is down, rebooted, or has a full disk that prevents local log writes.
Add a heartbeat to database backups
- 1
Create a heartbeat monitor in CronJobPro
Log into CronJobPro and create a new heartbeat monitor. Set the expected period to match your cron schedule — 24 hours for a nightly job, 1 hour for an hourly job. Set a grace window of 10 to 15 minutes to absorb normal runtime variance. Copy the unique ping URL assigned to this monitor; it looks like https://cronjobpro.com/ping/<token>. Configure at least one alert channel (email is the minimum; add Slack or PagerDuty if you have on-call rotations).
- 2
Wrap your backup command so the ping is gated on verified success
Do not add a bare curl ping at the end of your script. A bare ping at the end will fire even if earlier commands failed, because bash by default continues executing after a non-zero exit. Instead, capture the exit code of pg_dump or mysqldump explicitly, validate that the output file is non-empty, and only then call the ping URL. This means a disk-full failure that produces a zero-byte file is treated as a failure even though pg_dump may have exited 0 on some versions.
- 3
Add explicit failure reporting for known error conditions
Use the /fail endpoint to report known failures rather than simply not pinging. Call https://cronjobpro.com/ping/<token>/fail in your error handling path. This gives CronJobPro an immediate signal rather than waiting for the period to expire, which shortens your time-to-alert from potentially 24+ hours down to seconds. Include it in a trap handler so unexpected script exits are also reported.
- 4
Test the failure path, not just the success path
After wiring the heartbeat, verify that a failure actually triggers an alert. The easiest method is to temporarily point your backup script at a non-existent database or a full test filesystem, run it manually, and confirm that the /fail endpoint fires and an alert arrives on your configured channel. Many teams only test that the success ping works and discover years later that their alert routing was broken.
- 5
Verify the cron schedule matches the monitor period
A common misconfiguration is setting the monitor period to 24 hours while the cron schedule is 0 3 * * 1-5 (weekdays only). Over a weekend the monitor will fire spuriously. Either set the monitor period to account for the longest gap in your schedule, increase the grace window appropriately, or switch to a 7-day period with a matching grace window if backups are weekly. Document the intended schedule in a comment at the top of the cron entry so future operators do not change the cron schedule without also updating the monitor.
bash
#!/usr/bin/env bash
# backup-postgres.sh — production pg_dump with CronJobPro heartbeat
# Crontab entry: 0 2 * * * /opt/scripts/backup-postgres.sh >> /var/log/pg-backup.log 2>&1
set -euo pipefail
HEARTBEAT_URL="https://cronjobpro.com/ping/YOUR_TOKEN_HERE"
HEARTBEAT_FAIL_URL="https://cronjobpro.com/ping/YOUR_TOKEN_HERE/fail"
DB_NAME="myapp_production"
DB_USER="backup_user"
DB_HOST="127.0.0.1"
DB_PORT="5432"
BACKUP_DIR="/var/backups/postgres"
TIMESTAMP=$(date +%Y%m%d_%H%M%S)
OUTPUT_FILE="${BACKUP_DIR}/${DB_NAME}_${TIMESTAMP}.dump"
MIN_BYTES=1024 # treat anything smaller as a failed dump
# Report failure and exit on any unhandled error
trap 'echo "[ERROR] Backup failed at line ${LINENO}" ; \
curl -fsS --retry 3 "${HEARTBEAT_FAIL_URL}" || true ; \
exit 1' ERR
mkdir -p "${BACKUP_DIR}"
echo "[$(date -Iseconds)] Starting pg_dump for ${DB_NAME}"
# pg_dump with custom format (-Fc) for pg_restore compatibility.
# Credentials are read from ~/.pgpass (chmod 0600) or PGPASSWORD env var.
# -Fc produces a binary archive; use -Fp for plain SQL.
pg_dump \
-h "${DB_HOST}" \
-p "${DB_PORT}" \
-U "${DB_USER}" \
-Fc \
--no-password \
"${DB_NAME}" \
> "${OUTPUT_FILE}"
DUMP_EXIT=$?
if [ "${DUMP_EXIT}" -ne 0 ]; then
echo "[ERROR] pg_dump exited with code ${DUMP_EXIT}"
curl -fsS --retry 3 "https://cronjobpro.com/ping/YOUR_TOKEN_HERE/exitcode/${DUMP_EXIT}" || true
exit "${DUMP_EXIT}"
fi
# Validate the file is non-empty — catches disk-full partial writes
ACTUAL_BYTES=$(stat -c%s "${OUTPUT_FILE}" 2>/dev/null || echo 0)
if [ "${ACTUAL_BYTES}" -lt "${MIN_BYTES}" ]; then
echo "[ERROR] Dump file is ${ACTUAL_BYTES} bytes — below minimum threshold of ${MIN_BYTES}"
curl -fsS --retry 3 "${HEARTBEAT_FAIL_URL}" || true
exit 1
fi
echo "[$(date -Iseconds)] Dump succeeded: ${OUTPUT_FILE} (${ACTUAL_BYTES} bytes)"
# Optional: remove backups older than 7 days
find "${BACKUP_DIR}" -name "${DB_NAME}_*.dump" -mtime +7 -delete
# Ping the heartbeat ONLY after verified success
# --retry 3 handles transient network issues; || true prevents curl failure
# from masking a successful backup in logs
curl -fsS --retry 3 "${HEARTBEAT_URL}" || true
echo "[$(date -Iseconds)] Heartbeat sent. Backup complete."
# mysqldump equivalent (replace the pg_dump block above with this):
# mysqldump \
# --host="${DB_HOST}" \
# --user="${DB_USER}" \
# --password="${DB_PASS}" \
# --single-transaction \
# --routines \
# --triggers \
# "${DB_NAME}" \
# | gzip > "${OUTPUT_FILE}.sql.gz"
# DUMP_EXIT=${PIPESTATUS[0]} # capture mysqldump exit, not gzip exitFrequently asked questions
My backup script exits with code 0 but the dump file is empty or corrupt. Will the heartbeat approach catch this?
Only if you explicitly validate the file before pinging. pg_dump and mysqldump can exit 0 while producing an incomplete or zero-byte file in certain failure modes — for example when the target disk fills up mid-write on some OS configurations, or when a mysqldump pipeline writes through gzip and you check PIPESTATUS incorrectly. The code example above shows how to stat the output file and gate the ping on a minimum byte threshold. For stronger validation, pg_restore --list on the dump file will verify the archive header without performing a full restore.
How is a heartbeat monitor different from just checking cron logs or setting up email on cron failure?
Cron only emails output if a mail transfer agent is configured and the cron entry produces stdout or stderr. Most production systems either discard output (redirected to /dev/null) or never configured a local MTA. Even when email is working, it only fires when cron encounters a problem starting the job — it does not fire when the job runs but produces a bad result, when the cron entry is deleted, or when the server itself is unreachable. A heartbeat monitor requires no configuration on the server side beyond a single curl call and fires automatically when the expected ping simply does not arrive, regardless of the reason.
What period and grace window should I set for a nightly backup that typically takes 8 minutes?
Set the period to 24 hours and the grace window to 20 to 30 minutes. The grace window should comfortably exceed your worst-case runtime to avoid false alerts when the database is temporarily under load and the backup takes longer than usual. If your backup time is growing consistently — which itself signals a problem worth investigating — widen the grace window incrementally rather than silencing alerts. Avoid setting a very large grace window (several hours) because it delays the alert when a real failure occurs.
Should I use /ping/<token>/fail or just not ping at all when the backup fails?
Use /fail when you know a failure occurred. Calling the /fail endpoint sends an immediate alert rather than waiting for the period to expire. For a 24-hour monitor, not pinging means you will not receive an alert until 24 hours plus the grace window have elapsed — potentially a full day after the failure. The /fail endpoint fires the alert within seconds. Combine both approaches: use a bash trap ERR handler to call /fail on unexpected exits, and call /fail explicitly in any error-handling branch you write, so the monitor is informed as soon as possible regardless of how the failure happened.
Can I use this same pattern for mysqldump, pg_basebackup, or cloud-native backup tools like pg_dump to S3 via aws cli?
Yes. The heartbeat pattern is tool-agnostic — it depends only on your ability to run curl after confirming success. For mysqldump, replace the pg_dump call and use PIPESTATUS[0] to capture the exit code when piping through gzip or lz4. For pg_basebackup, check both the exit code and that the output directory is non-empty. For multi-step scripts that dump locally then upload to S3 with the AWS CLI, capture the exit code of both the dump and the aws s3 cp commands separately, call /fail if either fails, and only ping the success URL when both have completed with code 0 and the file size check passes.
More monitoring guides
Catch silent failures in database backups
Add one HTTP ping and CronJobPro alerts you the moment a run is missed or fails.