Database backup

Full, incremental, and log chains together determine the RPO you can claim; PITR depends on baseline backups plus continuous log retention aligned end-to-end. This page covers policy essentials, a backup-to-verification skill-flow, and a desktop RPO rough estimator (not an SLA proof).

A SKILL must state backup media (object storage, secondary site), encryption and key custody, access auditing, and who may initiate restore and run out-of-band restore tests.

For relational vs NoSQL, spell out consistency semantics: logical vs physical backup windows, replication lag impact on RPO, and caveats for large sharded tables.

Stress that “backup ≠recoverable— periodically sample-restore to an isolated environment, verify checksums, row counts, and key business queries, and document alert and ticket paths on failure.

Failed backup jobs must be handled within SLA—avoid silent backlog.
Regulated data must meet residency and redacted export rules.
Connect to disaster-recovery drill skills: same contacts, runbooks, and decision trees.

RPO and backup granularity

RPO (Recovery Point Objective)

Maximum acceptable data loss window after disaster or mistake. With only periodic full snapshots, worst case often approaches “between two fulls— with continuous WAL/log archive and a validated restore path, you can tighten claimed RPO toward “log-apply lag + batch delay—scale.

RTO vs backup shape

RTO is how fast you recover; backup type shapes the path (full restore, incremental chain, PITR replay). Cold large DB + long replay inflates RTO—document RPO/RTO assumptions and how you verify them in the SKILL.

Do not conflate async replicas, cross-region replication, and backup windows—they protect different surfaces; RPO follows the weakest link.
Logical export vs physical snapshot differ in consistency semantics; when agents write policy, state crash-consistent vs application-consistent assumptions.

# PostgreSQL backup commands

# 1. pg_dump — logical backup (application-consistent, suitable for single-DB export and cross-version migration)
pg_dump \
  --host=db.acme.internal \
  --port=5432 \
  --username=backup_user \
  --dbname=payments \
  --format=custom \               # custom format supports parallel restore
  --compress=6 \                  # compression level 0-9
  --jobs=4 \                      # parallel dump tables (with format=directory)
  --file=/backup/payments-$(date +%Y%m%d-%H%M%S).dump \
  --verbose 2>>/var/log/pg_dump.log

# Restore:
# pg_restore --dbname=payments_restore --jobs=4 /backup/payments-20240315.dump

# 2. pg_basebackup — physical backup (crash-consistent, used as PITR baseline)
pg_basebackup \
  --host=db.acme.internal \
  --port=5432 \
  --username=replication_user \
  --pgdata=/backup/base/$(date +%Y%m%d) \
  --format=tar \                  # tar format for compressed transfer
  --gzip \
  --compress=6 \
  --wal-method=stream \           # stream WAL simultaneously to ensure backup completeness
  --checkpoint=fast \             # create checkpoint quickly
  --progress \
  --verbose 2>>/var/log/pg_basebackup.log

echo "Backup size: $(du -sh /backup/base/$(date +%Y%m%d))"

PITR (point-in-time recovery)

PITR needs: a baseline (full or equivalent consistent point) plus a continuous log chain (WAL, binlog, oplog, etc.) available within retention, with restore tooling and permissions proven in a drill environment.

Log retention must cover the farthest rollback legal/ops may ask for, and be costed with full retention and compliance residency together.
Before “any timestamp—restore: confirm timezone, clock jumps, DDL, and logical replication conflicts affect replay.
Managed cloud: know the vendor PITR window, whether you also export to your own bucket, and whether cross-region replicas participate in PITR.

# PostgreSQL WAL archive configuration (postgresql.conf)
# RPO=1h means: WAL archive lag < 1h + full backup interval once per day

# Enable WAL archiving
wal_level = replica          # or logical (if logical replication needed)
archive_mode = on
archive_command = 's3cmd put %p s3://acme-backups/wal/%f'
# %p = WAL file path, %f = filename
# archive_command must return 0 on success; PG retries on failure

# WAL archive to S3 (pgBackRest is more complete for production)
# archive_command = 'pgbackrest --stanza=payments archive-push %p'

# Continuous archive verification (ensure no gaps in WAL archive)
# SELECT pg_walfile_name(pg_current_wal_lsn());  -- current WAL position
# Check if .ready files are accumulating in archive_status dir (sign of archive lag)

# PITR recovery configuration (recovery.conf / postgresql.conf PG12+)
# Recover to a specific point in time (UTC):
# restore_command = 's3cmd get s3://acme-backups/wal/%f %p'
# recovery_target_time = '2024-03-15 10:30:00 UTC'
# recovery_target_action = 'promote'   # auto-promote to primary after recovery completes

Media, encryption, and access

Object storage and regions: versioning, lifecycle, tamper protection (MFA delete, WORM, or org policy).
Keys: CMK/KMS, rotation, decrypt path on restore and break-glass; encrypt backup files and metadata alike.
Who may run restore, whether drills use an isolated VPC/account, and audit log retention.

#!/usr/bin/env bash
# backup-and-upload.sh — automated backup + S3 upload + failure alerting
set -euo pipefail

DB_HOST="${DB_HOST:-db.acme.internal}"
DB_NAME="${DB_NAME:-payments}"
S3_BUCKET="${S3_BUCKET:-s3://acme-backups}"
BACKUP_DIR="/tmp/backups"
TIMESTAMP=$(date +%Y%m%d-%H%M%S)
BACKUP_FILE="$BACKUP_DIR/${DB_NAME}-${TIMESTAMP}.dump"
ALERT_WEBHOOK="${SLACK_WEBHOOK_URL:-}"

mkdir -p "$BACKUP_DIR"

cleanup() {
  rm -f "$BACKUP_FILE" "$BACKUP_FILE.gpg"
}
trap cleanup EXIT

# 1. Backup
echo "[$(date)] Starting backup: $DB_NAME"
pg_dump \
  --host="$DB_HOST" --port=5432 \
  --username=backup_user \
  --dbname="$DB_NAME" \
  --format=custom --compress=6 \
  --file="$BACKUP_FILE"

# 2. Encrypt (GPG symmetric encryption; key managed by KMS)
gpg --batch --yes --symmetric \
    --passphrase-file /run/secrets/backup_passphrase \
    --output "${BACKUP_FILE}.gpg" \
    "$BACKUP_FILE"
rm -f "$BACKUP_FILE"

# 3. Upload to S3 (server-side encryption + versioning)
aws s3 cp "${BACKUP_FILE}.gpg" \
    "${S3_BUCKET}/daily/${DB_NAME}/${TIMESTAMP}.dump.gpg" \
    --sse aws:kms \
    --storage-class STANDARD_IA \
    --metadata "db=${DB_NAME},timestamp=${TIMESTAMP}"

# 4. Record backup metadata (for monitoring alerts)
echo "${TIMESTAMP} SUCCESS ${DB_NAME}" >> /var/log/backup_history.log
echo "[$(date)] Backup completed: ${TIMESTAMP}.dump.gpg"

# crontab -e configuration (run daily at 2 AM):
# 0 2 * * * /opt/scripts/backup-and-upload.sh >> /var/log/backup-cron.log 2>&1

#!/usr/bin/env bash
# backup-verify.sh — backup restore drill (isolated environment)
# Recommended: run monthly (or on schedule in CI)
set -euo pipefail

S3_BUCKET="s3://acme-backups"
RESTORE_HOST="${RESTORE_HOST:-restore-test.internal}"
RESTORE_DB="payments_restore_$(date +%Y%m%d)"
BACKUP_FILE="/tmp/verify-restore.dump.gpg"

echo "=== Step 1: Download latest backup ==="
LATEST=$(aws s3 ls "${S3_BUCKET}/daily/payments/" \
         | sort | tail -1 | awk '{print $4}')
aws s3 cp "${S3_BUCKET}/daily/payments/${LATEST}" "$BACKUP_FILE"

echo "=== Step 2: Decrypt ==="
gpg --batch --yes --decrypt \
    --passphrase-file /run/secrets/backup_passphrase \
    --output "${BACKUP_FILE%.gpg}" \
    "$BACKUP_FILE"

echo "=== Step 3: Restore to isolated DB ==="
createdb --host="$RESTORE_HOST" "$RESTORE_DB"
pg_restore --host="$RESTORE_HOST" \
           --dbname="$RESTORE_DB" \
           --jobs=4 \
           "${BACKUP_FILE%.gpg}"

echo "=== Step 4: Verify row counts ==="
ROW_COUNT=$(psql --host="$RESTORE_HOST" --dbname="$RESTORE_DB" \
            -tAc "SELECT COUNT(*) FROM orders")
EXPECTED=100  # minimum expected row count (from monitoring baseline)
[ "$ROW_COUNT" -ge "$EXPECTED" ] \
  && echo "✅ Verify passed: orders=$ROW_COUNT" \
  || { echo "❌ Verify FAILED: orders=$ROW_COUNT < expected=$EXPECTED"; exit 1; }

echo "=== Step 5: Smoke query ==="
psql --host="$RESTORE_HOST" --dbname="$RESTORE_DB" \
  -c "SELECT status, COUNT(*) FROM orders GROUP BY status LIMIT 5;"

# Cleanup
dropdb --host="$RESTORE_HOST" "$RESTORE_DB" || true
echo "=== Restore drill completed at $(date) ==="

Backup-to-verification main flow (skill-flow)

  [ Business RPO/RTO and compliance constraints ]
        │
        ▼
  ┌─────────────│    Mix: full / incremental / log chain / replicas
  │Choose &     │──── Retention: fulls, incrementals, logs, cross-region
  │ retain      │    aligned
  └─────────────│
        │
        ▼
  ┌─────────────│    Schedule, concurrency, prod impact, failure alerts
  │Run backup   │──── Checksums, manifest, metadata (DB, LSN, time)
  └─────────────│
        │
        ▼
  ┌─────────────│    Encryption, least privilege, audit; sample restores
  │Store &      │──── Align with key rotation, cost, lifecycle policy
  │ govern      │
  └─────────────│
        │
        ▼
  ┌─────────────│    Isolated restore + PITR dry run (if promised)
  │Verify       │──── Row counts/sample queries/app smoke; gaps →tickets
  │ recovery    │
  └─────────────│
        │
        ▼
  ┌─────────────│    Match DR runbook, RTO timebox, contacts
  │Link drills  │
  └─────────────┘

The rough calculator only aligns verbal RPO with backup design at an order of magnitude; formal commitments need monitoring, drills, and contracts.

# Backup failure alerting (Prometheus AlertManager rules)
# prometheus/alerts/backup.yml
groups:
  - name: database_backup
    rules:
      # No successful backup in 26 hours (full backup every 24h, 2h tolerance)
      - alert: BackupMissing
        expr: |
          time() - max(backup_last_success_timestamp_seconds{job="pg-backup"}) > 93600
        for: 5m
        labels: { severity: critical }
        annotations:
          summary: "Database backup timed out"
          description: "payments DB has had no successful backup in 26h, current RPO at risk"

      # Backup file size anomaly (sudden decrease may indicate empty backup)
      - alert: BackupSizeAnomaly
        expr: |
          backup_size_bytes{job="pg-backup"} < 1e8   # alert if less than 100MB
        for: 1m
        labels: { severity: warning }

# RPO/RTO calculation example (what RPO=1h means for configuration)
# ──────────────────────────────────────────────────
# Target: RPO = 1h (max 1 hour data loss)
# Approach: daily full backup + continuous WAL archive
#
# Configuration requirements:
#   archive_command triggers WAL switch within 60s (or WAL fills 16MB auto-switches)
#   archive_status: no .ready files accumulating (archive lag = actual RPO)
#   Monitoring: max(WAL archive timestamp) - now() < 60min
#
# RTO calculation example (100GB database):
#   pg_basebackup restore: ~20min (100GB / 100MB/s network)
#   WAL replay (up to 1h of WAL): ~5min
#   Total RTO ≈ 25min
#
# Improve RTO: use streaming replica + fast failover (<30s)
#   → RTO = failover switchover time, no need to restore full backup
# ──────────────────────────────────────────────────
# Push backup metric (bash + curl)
push_backup_metric() {
  local status=$1  # 0=success, 1=fail
  curl -s --data-binary @- \
    "http://pushgateway:9091/metrics/job/pg-backup/instance/payments" <<EOF
backup_last_success_timestamp_seconds $([ "$status" -eq 0 ] && date +%s || echo 0)
backup_size_bytes $(stat -f%z "$BACKUP_FILE" 2>/dev/null || echo 0)
EOF
}

RPO rough calculator

A simplified model for a minute-scale lower bound on RPO under design intent: without continuous logs, worst-case loss follows full-backup spacing (whole interval); with continuous logs and working PITR, baseline is max log-apply lag; then take the larger of that and async replica lag. For review drafts only.

Full backup interval (hours) Max log chain lag (minutes) Max async replica lag (minutes) Continuous log archive enabled and PITR validated

Model: without logs, RPO_baseline = full interval (h) × 60; with logs, RPO_baseline = max log lag; then max with replica lag. Does not include accidental backup deletion, unusable keys, or failed replay—list those in runbooks and drills separately.

---
name: database-backup
description: Backup commands, automation scripts, restore drills, and monitoring alerts
tags: [database, backup, postgres, disaster-recovery]
---
# Backup commands
pg_dump --format=custom --compress=6 --jobs=4 (logical backup)
pg_basebackup --wal-method=stream --gzip (physical backup, PITR baseline)
# Automated backup script
cron: 0 2 * * * (run daily at 2 AM)
steps: pg_dump → gpg encrypt → aws s3 cp --sse aws:kms
metadata: backup timestamp + size written to monitoring → Prometheus pushgateway
# WAL archiving (configuration for RPO=1h)
archive_mode=on; archive_command uploads each WAL segment to S3
monitoring: max(archive timestamp) - now() < 60min (else trigger BackupMissing alert)
RTO estimate: restore 100GB ≈ 20min + WAL replay 1h ≈ 5min → RTO≈25min
# Restore drill script
monthly: download latest backup → decrypt → pg_restore → row count verification → smoke query
isolated environment: independent VPC/account, no impact on production
drill record: actual RTO + gaps + ticket number
# Alert rules
BackupMissing: time()-last_success > 93600s (26h) → critical
BackupSizeAnomaly: size < 100MB → warning
Replication lag: pg_last_xact_replay_timestamp() lag > RPO/2 → warning

Back to skills More skills