erp-core/docs/07-devops/BACKUP-RECOVERY.md

30 KiB

BACKUP & RECOVERY - ERP Generic

Última actualización: 2025-11-24 Responsable: DevOps Team / DBA Team Estado: Production-Ready


TABLE OF CONTENTS

  1. Overview
  2. Backup Strategy
  3. Backup Scripts
  4. Multi-Tenant Backup Isolation
  5. Retention Policy
  6. Recovery Procedures
  7. Point-in-Time Recovery (PITR)
  8. Disaster Recovery Playbook
  9. Backup Testing
  10. References

1. OVERVIEW

1.1 Backup Architecture

┌─────────────────────────────────────────────────────────────┐
│                   PostgreSQL 16 Database                    │
│  ┌──────────┐  ┌──────────┐  ┌──────────┐  ┌──────────┐   │
│  │  auth    │  │  core    │  │ financial│  │ inventory│   │
│  │  schema  │  │  schema  │  │  schema  │  │  schema  │   │
│  └────┬─────┘  └────┬─────┘  └────┬─────┘  └────┬─────┘   │
│       │             │             │             │           │
│  ┌──────────┐  ┌──────────┐  ┌──────────┐  ┌──────────┐   │
│  │ purchase │  │  sales   │  │ analytics│  │ projects │   │
│  │  schema  │  │  schema  │  │  schema  │  │  schema  │   │
│  └────┬─────┘  └────┬─────┘  └────┬─────┘  └────┬─────┘   │
│       │             │             │             │           │
│  ┌──────────┐                                               │
│  │  system  │                                               │
│  │  schema  │                                               │
│  └────┬─────┘                                               │
└───────┼───────────────────────────────────────────────────┘
        │
        ↓ (Automated backup every 4 hours)
┌─────────────────────────────────────────────────────────────┐
│              Local Backup Storage (/backups)                │
│  ┌──────────────┐  ┌──────────────┐  ┌──────────────┐     │
│  │ Full Backup  │  │  Incremental │  │   Per-Schema │     │
│  │   (Daily)    │  │  (4 hours)   │  │    Backups   │     │
│  └──────┬───────┘  └──────┬───────┘  └──────┬───────┘     │
└─────────┼──────────────────┼──────────────────┼─────────────┘
          │                  │                  │
          │ (Sync every hour)│                  │
          ↓                  ↓                  ↓
┌─────────────────────────────────────────────────────────────┐
│           Cloud Storage (S3 / Azure Blob / GCS)             │
│  ┌──────────────────────────────────────────────────────┐  │
│  │  Versioning Enabled | Lifecycle Rules | Encryption   │  │
│  │  Retention: 7d + 4w + 12m                            │  │
│  └──────────────────────────────────────────────────────┘  │
└─────────────────────────────────────────────────────────────┘

┌─────────────────────────────────────────────────────────────┐
│              WAL (Write-Ahead Logs) Archive                 │
│  Continuous archiving for Point-in-Time Recovery (PITR)    │
│  Retention: 7 days                                          │
└─────────────────────────────────────────────────────────────┘

1.2 Backup Objectives

RTO (Recovery Time Objective): 4 hours

  • Maximum time to restore service after failure

RPO (Recovery Point Objective): 15 minutes

  • Maximum data loss acceptable (via WAL archiving)

Data Durability: 99.999999999% (11 nines)

  • Multi-region cloud storage replication

Backup Types:

  1. Full Backup: Complete database snapshot (daily at 2:00 AM)
  2. Incremental Backup: Changes since last full backup (every 4 hours)
  3. WAL Archive: Continuous archiving for PITR (every 60 seconds)
  4. Per-Schema Backup: Individual schema backups for multi-tenant isolation (daily)

2. BACKUP STRATEGY

2.1 Backup Schedule

Backup Type Frequency Retention Size (Est.) Duration
Full Backup Daily (2:00 AM) 7 days local + 12 months cloud 50-100 GB 30-45 min
Incremental Backup Every 4 hours 7 days local 5-10 GB 5-10 min
WAL Archive Continuous (every 16 MB) 7 days 100-200 GB/week Real-time
Per-Schema Backup Daily (3:00 AM) 7 days local + 4 weeks cloud 5-15 GB each 5-10 min each
Config Backup On change + daily 30 days <100 MB <1 min

2.2 Backup Storage Locations

Primary Storage (Local):

  • Path: /backups/postgres/
  • Filesystem: XFS (optimized for large files)
  • Capacity: 500 GB minimum
  • RAID: RAID 10 (performance + redundancy)

Secondary Storage (Cloud):

  • Provider: AWS S3 / Azure Blob Storage / Google Cloud Storage
  • Bucket: erp-generic-backups-prod
  • Region: Multi-region replication (e.g., us-east-1 + us-west-2)
  • Encryption: AES-256 at rest
  • Versioning: Enabled (30 versions max)
  • Lifecycle Rules:
    • Move to Glacier after 30 days
    • Delete after 1 year (except annual backups)

Tertiary Storage (Offsite):

  • Type: Tape backup / Cold storage
  • Frequency: Monthly
  • Retention: 7 years (compliance requirement)

2.3 Backup Validation

Automated Validation (Daily):

# 1. Verify backup file integrity
md5sum /backups/postgres/full_20251124_020000.dump > /backups/postgres/full_20251124_020000.dump.md5

# 2. Test restore to staging environment (weekly)
pg_restore --dbname=erp_generic_staging --clean --if-exists /backups/postgres/full_20251124_020000.dump

# 3. Run smoke tests on restored database
psql -d erp_generic_staging -c "SELECT COUNT(*) FROM auth.users;"
psql -d erp_generic_staging -c "SELECT COUNT(*) FROM core.partners;"

# 4. Compare row counts with production
diff <(psql -d erp_generic -tAc "SELECT tablename, n_live_tup FROM pg_stat_user_tables ORDER BY tablename") \
     <(psql -d erp_generic_staging -tAc "SELECT tablename, n_live_tup FROM pg_stat_user_tables ORDER BY tablename")

Manual Validation (Monthly):

  • Full disaster recovery drill
  • Restore to isolated environment
  • Verify business-critical data
  • Test application functionality
  • Document findings in post-mortem

3. BACKUP SCRIPTS

3.1 Full Backup Script

File: scripts/backup-postgres.sh

#!/bin/bash
# =====================================================
# ERP GENERIC - PostgreSQL Full Backup Script
# Performs full database backup with multi-schema support
# =====================================================

set -euo pipefail

# Configuration
BACKUP_DIR="/backups/postgres"
TIMESTAMP=$(date +"%Y%m%d_%H%M%S")
RETENTION_DAYS=7
DB_HOST="${POSTGRES_HOST:-postgres}"
DB_PORT="${POSTGRES_PORT:-5432}"
DB_NAME="${POSTGRES_DB:-erp_generic}"
DB_USER="${POSTGRES_USER:-erp_user}"
PGPASSWORD="${POSTGRES_PASSWORD}"
export PGPASSWORD

# Logging
LOG_FILE="/var/log/erp-generic/backup.log"
exec > >(tee -a "$LOG_FILE")
exec 2>&1

echo "===== PostgreSQL Backup Started at $(date) ====="

# Create backup directory if not exists
mkdir -p "$BACKUP_DIR"

# 1. Full Database Backup
echo "1. Creating full database backup..."
FULL_BACKUP_FILE="${BACKUP_DIR}/full_${TIMESTAMP}.dump"
pg_dump -h "$DB_HOST" -p "$DB_PORT" -U "$DB_USER" -Fc -v -d "$DB_NAME" -f "$FULL_BACKUP_FILE"

if [ $? -eq 0 ]; then
    echo "✓ Full backup created: $FULL_BACKUP_FILE"
    FILE_SIZE=$(du -h "$FULL_BACKUP_FILE" | cut -f1)
    echo "  Size: $FILE_SIZE"
else
    echo "✗ Full backup failed!"
    exit 1
fi

# 2. Generate MD5 checksum
echo "2. Generating checksum..."
md5sum "$FULL_BACKUP_FILE" > "${FULL_BACKUP_FILE}.md5"
echo "✓ Checksum saved: ${FULL_BACKUP_FILE}.md5"

# 3. Per-Schema Backups (Multi-Tenant Isolation)
echo "3. Creating per-schema backups..."
SCHEMAS=("auth" "core" "financial" "inventory" "purchase" "sales" "analytics" "projects" "system")

for schema in "${SCHEMAS[@]}"; do
    SCHEMA_BACKUP_FILE="${BACKUP_DIR}/${schema}_${TIMESTAMP}.dump"
    echo "  Backing up schema: $schema"

    pg_dump -h "$DB_HOST" -p "$DB_PORT" -U "$DB_USER" -Fc -n "$schema" -d "$DB_NAME" -f "$SCHEMA_BACKUP_FILE"

    if [ $? -eq 0 ]; then
        SCHEMA_SIZE=$(du -h "$SCHEMA_BACKUP_FILE" | cut -f1)
        echo "  ✓ $schema backup created ($SCHEMA_SIZE)"
    else
        echo "  ✗ $schema backup failed!"
    fi
done

# 4. Backup Database Roles and Permissions
echo "4. Backing up database roles..."
ROLES_BACKUP_FILE="${BACKUP_DIR}/roles_${TIMESTAMP}.sql"
pg_dumpall -h "$DB_HOST" -p "$DB_PORT" -U "$DB_USER" --roles-only -f "$ROLES_BACKUP_FILE"
echo "✓ Roles backup created: $ROLES_BACKUP_FILE"

# 5. Backup PostgreSQL Configuration
echo "5. Backing up PostgreSQL configuration..."
CONFIG_BACKUP_DIR="${BACKUP_DIR}/config_${TIMESTAMP}"
mkdir -p "$CONFIG_BACKUP_DIR"

# Copy config files (if accessible)
if [ -f /etc/postgresql/16/main/postgresql.conf ]; then
    cp /etc/postgresql/16/main/postgresql.conf "$CONFIG_BACKUP_DIR/"
    cp /etc/postgresql/16/main/pg_hba.conf "$CONFIG_BACKUP_DIR/"
    echo "✓ Config files backed up"
fi

# 6. Backup WAL Archive Status
echo "6. Recording WAL archive status..."
psql -h "$DB_HOST" -p "$DB_PORT" -U "$DB_USER" -d "$DB_NAME" -tAc "SELECT * FROM pg_stat_archiver;" > "${BACKUP_DIR}/wal_status_${TIMESTAMP}.txt"

# 7. Upload to Cloud Storage (S3)
if [ -n "${AWS_S3_BUCKET:-}" ]; then
    echo "7. Uploading to S3..."
    aws s3 cp "$FULL_BACKUP_FILE" "s3://${AWS_S3_BUCKET}/postgres/${TIMESTAMP}/" --storage-class STANDARD_IA
    aws s3 cp "${FULL_BACKUP_FILE}.md5" "s3://${AWS_S3_BUCKET}/postgres/${TIMESTAMP}/"

    # Upload per-schema backups
    for schema in "${SCHEMAS[@]}"; do
        SCHEMA_BACKUP_FILE="${BACKUP_DIR}/${schema}_${TIMESTAMP}.dump"
        if [ -f "$SCHEMA_BACKUP_FILE" ]; then
            aws s3 cp "$SCHEMA_BACKUP_FILE" "s3://${AWS_S3_BUCKET}/postgres/${TIMESTAMP}/schemas/" --storage-class STANDARD_IA
        fi
    done

    echo "✓ Backup uploaded to S3"
fi

# 8. Cleanup Old Backups (Local)
echo "8. Cleaning up old backups (older than $RETENTION_DAYS days)..."
find "$BACKUP_DIR" -type f -name "*.dump" -mtime +$RETENTION_DAYS -delete
find "$BACKUP_DIR" -type f -name "*.sql" -mtime +$RETENTION_DAYS -delete
find "$BACKUP_DIR" -type f -name "*.md5" -mtime +$RETENTION_DAYS -delete
find "$BACKUP_DIR" -type d -name "config_*" -mtime +$RETENTION_DAYS -exec rm -rf {} + 2>/dev/null || true
echo "✓ Old backups cleaned up"

# 9. Verify Backup Integrity
echo "9. Verifying backup integrity..."
md5sum -c "${FULL_BACKUP_FILE}.md5"
if [ $? -eq 0 ]; then
    echo "✓ Backup integrity verified"
else
    echo "✗ Backup integrity check failed!"
    exit 1
fi

# 10. Send Notification
echo "10. Sending backup notification..."
BACKUP_SIZE=$(du -sh "$BACKUP_DIR" | cut -f1)

# Slack notification (optional)
if [ -n "${SLACK_WEBHOOK_URL:-}" ]; then
    curl -X POST "$SLACK_WEBHOOK_URL" \
        -H 'Content-Type: application/json' \
        -d "{\"text\": \"✅ PostgreSQL backup completed successfully\n• Database: $DB_NAME\n• Size: $FILE_SIZE\n• Timestamp: $TIMESTAMP\n• Total backup dir size: $BACKUP_SIZE\"}"
fi

echo "===== PostgreSQL Backup Completed at $(date) ====="
echo "Total backup size: $BACKUP_SIZE"
echo "Backup location: $BACKUP_DIR"

# Exit successfully
exit 0

3.2 Incremental Backup Script

File: scripts/backup-postgres-incremental.sh

#!/bin/bash
# =====================================================
# ERP GENERIC - PostgreSQL Incremental Backup Script
# Uses pg_dump with --snapshot for consistency
# =====================================================

set -euo pipefail

BACKUP_DIR="/backups/postgres/incremental"
TIMESTAMP=$(date +"%Y%m%d_%H%M%S")
DB_HOST="${POSTGRES_HOST:-postgres}"
DB_NAME="${POSTGRES_DB:-erp_generic}"
DB_USER="${POSTGRES_USER:-erp_user}"
PGPASSWORD="${POSTGRES_PASSWORD}"
export PGPASSWORD

mkdir -p "$BACKUP_DIR"

echo "===== Incremental Backup Started at $(date) ====="

# Get last full backup timestamp
LAST_FULL_BACKUP=$(ls -t /backups/postgres/full_*.dump | head -1 | grep -oP '\d{8}_\d{6}')
echo "Last full backup: $LAST_FULL_BACKUP"

# Incremental backup: Export only changed data (simplified approach)
# In production, consider using WAL-based incremental backups or pg_basebackup

INCREMENTAL_FILE="${BACKUP_DIR}/incremental_${TIMESTAMP}.dump"
pg_dump -h "$DB_HOST" -U "$DB_USER" -Fc -d "$DB_NAME" -f "$INCREMENTAL_FILE"

echo "✓ Incremental backup created: $INCREMENTAL_FILE"
echo "===== Incremental Backup Completed at $(date) ====="

3.3 WAL Archiving Configuration

File: postgresql.conf (WAL archiving section)

# WAL Settings for Point-in-Time Recovery
wal_level = replica
archive_mode = on
archive_command = 'test ! -f /backups/wal/%f && cp %p /backups/wal/%f && aws s3 cp /backups/wal/%f s3://${AWS_S3_BUCKET}/wal/'
archive_timeout = 60s
max_wal_senders = 3
wal_keep_size = 1GB

3.4 Cron Schedule

File: /etc/cron.d/erp-backup

# ERP Generic Backup Schedule

# Full backup daily at 2:00 AM
0 2 * * * root /opt/erp-generic/scripts/backup-postgres.sh >> /var/log/erp-generic/backup.log 2>&1

# Incremental backup every 4 hours
0 */4 * * * root /opt/erp-generic/scripts/backup-postgres-incremental.sh >> /var/log/erp-generic/backup.log 2>&1

# Verify backups daily at 4:00 AM
0 4 * * * root /opt/erp-generic/scripts/verify-backup.sh >> /var/log/erp-generic/backup.log 2>&1

# Cleanup old WAL files daily at 5:00 AM
0 5 * * * root find /backups/wal -type f -mtime +7 -delete

# Weekly full disaster recovery test (Sundays at 3:00 AM)
0 3 * * 0 root /opt/erp-generic/scripts/test-restore.sh >> /var/log/erp-generic/backup-test.log 2>&1

4. MULTI-TENANT BACKUP ISOLATION

4.1 Per-Tenant Backup Strategy

Why Per-Schema Backups?

  • Restore individual tenant without affecting others
  • Compliance: GDPR right to erasure (delete tenant data)
  • Tenant migration to dedicated instance
  • Faster restore times for single tenant issues

Backup Structure:

/backups/postgres/
├── full_20251124_020000.dump              # All schemas
├── auth_20251124_020000.dump              # Auth schema only
├── core_20251124_020000.dump              # Core schema only
├── financial_20251124_020000.dump         # Financial schema only
├── inventory_20251124_020000.dump         # Inventory schema only
├── purchase_20251124_020000.dump          # Purchase schema only
├── sales_20251124_020000.dump             # Sales schema only
├── analytics_20251124_020000.dump         # Analytics schema only
├── projects_20251124_020000.dump          # Projects schema only
└── system_20251124_020000.dump            # System schema only

4.2 Restore Single Tenant

#!/bin/bash
# Restore single tenant (schema isolation)

TENANT_ID="tenant-abc"
SCHEMA_NAME="financial"  # Example: restore financial schema only
BACKUP_FILE="/backups/postgres/financial_20251124_020000.dump"

echo "Restoring schema $SCHEMA_NAME for tenant $TENANT_ID..."

# Option 1: Drop and restore entire schema
psql -h postgres -U erp_user -d erp_generic -c "DROP SCHEMA IF EXISTS $SCHEMA_NAME CASCADE;"
pg_restore -h postgres -U erp_user -d erp_generic -n $SCHEMA_NAME --clean --if-exists "$BACKUP_FILE"

# Option 2: Restore to temporary schema, then copy tenant-specific data
pg_restore -h postgres -U erp_user -d erp_generic -n ${SCHEMA_NAME}_temp --create "$BACKUP_FILE"

# Copy tenant-specific data
psql -h postgres -U erp_user -d erp_generic <<SQL
-- Copy tenant data from temp schema
INSERT INTO $SCHEMA_NAME.accounts
SELECT * FROM ${SCHEMA_NAME}_temp.accounts WHERE tenant_id = '$TENANT_ID';

-- Drop temp schema
DROP SCHEMA ${SCHEMA_NAME}_temp CASCADE;
SQL

echo "✓ Tenant $TENANT_ID restored from $SCHEMA_NAME schema"

5. RETENTION POLICY

5.1 Retention Rules

Local Storage (Fast Recovery):

  • Full Backups: 7 days
  • Incremental Backups: 7 days
  • WAL Archives: 7 days
  • Config Backups: 30 days

Cloud Storage (Long-term Retention):

  • Daily Backups: 30 days (Standard storage)
  • Weekly Backups: 12 weeks (Standard-IA storage)
  • Monthly Backups: 12 months (Glacier storage)
  • Annual Backups: 7 years (Glacier Deep Archive) - Compliance

Lifecycle Policy (AWS S3 Example):

{
  "Rules": [
    {
      "Id": "TransitionToIA",
      "Status": "Enabled",
      "Transitions": [
        {
          "Days": 30,
          "StorageClass": "STANDARD_IA"
        }
      ]
    },
    {
      "Id": "TransitionToGlacier",
      "Status": "Enabled",
      "Transitions": [
        {
          "Days": 90,
          "StorageClass": "GLACIER"
        }
      ]
    },
    {
      "Id": "DeleteOldBackups",
      "Status": "Enabled",
      "Expiration": {
        "Days": 365
      },
      "Filter": {
        "Prefix": "postgres/daily/"
      }
    }
  ]
}

5.2 Compliance Requirements

GDPR (EU):

  • Right to erasure: Ability to delete tenant data
  • Backup encryption: AES-256
  • Access logging: Who accessed what backup

SOC 2:

  • Backup testing: Monthly restore drills
  • Access controls: Role-based access to backups
  • Audit trail: Log all backup/restore operations

Industry-Specific:

  • Healthcare (HIPAA): 6 years retention
  • Financial (SOX): 7 years retention
  • Government: Variable (check local regulations)

6. RECOVERY PROCEDURES

6.1 Full Database Restore

File: scripts/restore-postgres.sh

#!/bin/bash
# =====================================================
# ERP GENERIC - PostgreSQL Restore Script
# Restores database from backup file
# =====================================================

set -euo pipefail

# Usage
if [ $# -lt 1 ]; then
    echo "Usage: $0 <backup_file> [--target=<database>] [--no-prompt]"
    echo "Example: $0 /backups/postgres/full_20251124_020000.dump --target=erp_generic_staging"
    exit 1
fi

BACKUP_FILE="$1"
TARGET_DB="${2:-erp_generic}"
NO_PROMPT=false

# Parse arguments
for arg in "$@"; do
    case $arg in
        --target=*)
            TARGET_DB="${arg#*=}"
            shift
            ;;
        --no-prompt)
            NO_PROMPT=true
            shift
            ;;
    esac
done

DB_HOST="${POSTGRES_HOST:-postgres}"
DB_USER="${POSTGRES_USER:-erp_user}"
PGPASSWORD="${POSTGRES_PASSWORD}"
export PGPASSWORD

echo "===== PostgreSQL Restore Started at $(date) ====="
echo "Backup file: $BACKUP_FILE"
echo "Target database: $TARGET_DB"

# Verify backup file exists
if [ ! -f "$BACKUP_FILE" ]; then
    echo "✗ Backup file not found: $BACKUP_FILE"
    exit 1
fi

# Verify checksum if exists
if [ -f "${BACKUP_FILE}.md5" ]; then
    echo "Verifying backup integrity..."
    md5sum -c "${BACKUP_FILE}.md5"
    if [ $? -ne 0 ]; then
        echo "✗ Backup integrity check failed!"
        exit 1
    fi
    echo "✓ Backup integrity verified"
fi

# Safety prompt (unless --no-prompt)
if [ "$NO_PROMPT" = false ]; then
    echo ""
    echo "⚠️  WARNING: This will OVERWRITE all data in database '$TARGET_DB'"
    echo "⚠️  Make sure you have a recent backup before proceeding!"
    echo ""
    read -p "Are you sure you want to continue? (yes/no): " -r
    if [[ ! $REPLY =~ ^[Yy][Ee][Ss]$ ]]; then
        echo "Restore cancelled by user."
        exit 0
    fi
fi

# Create safety backup of target database (if not staging/dev)
if [[ ! "$TARGET_DB" =~ (staging|dev|test) ]]; then
    echo "Creating safety backup of $TARGET_DB before restore..."
    SAFETY_BACKUP="/backups/postgres/safety_${TARGET_DB}_$(date +%Y%m%d_%H%M%S).dump"
    pg_dump -h "$DB_HOST" -U "$DB_USER" -Fc -d "$TARGET_DB" -f "$SAFETY_BACKUP"
    echo "✓ Safety backup created: $SAFETY_BACKUP"
fi

# Terminate active connections to target database
echo "Terminating active connections to $TARGET_DB..."
psql -h "$DB_HOST" -U "$DB_USER" -d postgres <<SQL
SELECT pg_terminate_backend(pid)
FROM pg_stat_activity
WHERE datname = '$TARGET_DB' AND pid <> pg_backend_pid();
SQL

# Restore database
echo "Restoring database from $BACKUP_FILE..."
pg_restore -h "$DB_HOST" -U "$DB_USER" -d "$TARGET_DB" --clean --if-exists --verbose "$BACKUP_FILE"

if [ $? -eq 0 ]; then
    echo "✓ Database restored successfully"
else
    echo "✗ Restore failed!"
    exit 1
fi

# Verify restore
echo "Verifying restore..."
TABLE_COUNT=$(psql -h "$DB_HOST" -U "$DB_USER" -d "$TARGET_DB" -tAc "SELECT COUNT(*) FROM information_schema.tables WHERE table_schema NOT IN ('pg_catalog', 'information_schema');")
USER_COUNT=$(psql -h "$DB_HOST" -U "$DB_USER" -d "$TARGET_DB" -tAc "SELECT COUNT(*) FROM auth.users;" 2>/dev/null || echo "N/A")

echo "Tables restored: $TABLE_COUNT"
echo "Users in auth.users: $USER_COUNT"

# Rebuild statistics
echo "Rebuilding database statistics..."
psql -h "$DB_HOST" -U "$DB_USER" -d "$TARGET_DB" -c "ANALYZE;"
echo "✓ Statistics rebuilt"

echo "===== PostgreSQL Restore Completed at $(date) ====="
echo ""
echo "Next steps:"
echo "1. Verify application functionality"
echo "2. Run smoke tests: npm run test:smoke"
echo "3. Check logs for errors"
echo "4. Notify team that restore is complete"

exit 0

6.2 Schema-Level Restore

# Restore single schema
SCHEMA="financial"
BACKUP_FILE="/backups/postgres/financial_20251124_020000.dump"

pg_restore -h postgres -U erp_user -d erp_generic -n $SCHEMA --clean --if-exists "$BACKUP_FILE"

6.3 Table-Level Restore

# Restore single table
TABLE="auth.users"
BACKUP_FILE="/backups/postgres/full_20251124_020000.dump"

pg_restore -h postgres -U erp_user -d erp_generic -t $TABLE --data-only "$BACKUP_FILE"

7. POINT-IN-TIME RECOVERY (PITR)

7.1 PITR Process

Use Case: Restore database to specific point in time (e.g., before accidental DELETE query)

Requirements:

  • Base backup (full backup)
  • WAL archives from base backup to target time

Steps:

#!/bin/bash
# Point-in-Time Recovery (PITR)

TARGET_TIME="2025-11-24 14:30:00"
BASE_BACKUP="/backups/postgres/full_20251124_020000.dump"
WAL_ARCHIVE_DIR="/backups/wal"

echo "===== Point-in-Time Recovery to $TARGET_TIME ====="

# 1. Stop PostgreSQL
docker-compose stop postgres

# 2. Backup current data (safety)
mv /data/postgres /data/postgres_backup_$(date +%Y%m%d_%H%M%S)

# 3. Restore base backup
mkdir -p /data/postgres
pg_basebackup -h postgres -U erp_user -D /data/postgres -Fp -Xs -P

# 4. Create recovery configuration
cat > /data/postgres/recovery.conf <<EOF
restore_command = 'cp /backups/wal/%f %p'
recovery_target_time = '$TARGET_TIME'
recovery_target_action = 'promote'
EOF

# 5. Start PostgreSQL (will enter recovery mode)
docker-compose start postgres

# 6. Monitor recovery
tail -f /var/log/postgresql/postgresql-*.log | grep -i recovery

echo "Recovery complete! Database restored to $TARGET_TIME"

7.2 PITR Best Practices

  • Test PITR quarterly in non-production environment
  • Document recovery times (usually 30 min - 2 hours depending on WAL size)
  • Automate WAL archiving to S3/cloud storage
  • Monitor WAL archive lag (should be <5 minutes)

8. DISASTER RECOVERY PLAYBOOK

8.1 Disaster Scenarios

Scenario 1: Complete Data Center Failure

Symptoms:

  • All servers unreachable
  • Network connectivity lost
  • Physical infrastructure damaged

Recovery Steps:

# 1. Provision new infrastructure (AWS, Azure, GCP)
terraform apply -var-file=disaster-recovery.tfvars

# 2. Download latest backup from cloud storage
aws s3 sync s3://erp-generic-backups-prod/postgres/ /backups/postgres/ --exclude "*" --include "full_*"

# 3. Deploy Docker containers
docker-compose -f docker-compose.prod.yml up -d postgres redis

# 4. Restore database
LATEST_BACKUP=$(ls -t /backups/postgres/full_*.dump | head -1)
./scripts/restore-postgres.sh "$LATEST_BACKUP" --no-prompt

# 5. Restore WAL archives for PITR
aws s3 sync s3://erp-generic-backups-prod/wal/ /backups/wal/

# 6. Deploy application
docker-compose -f docker-compose.prod.yml up -d backend frontend nginx

# 7. Update DNS (point to new infrastructure)
# Manual step or use Route53/CloudFlare API

# 8. Verify functionality
./scripts/health-check.sh
npm run test:smoke

# 9. Notify stakeholders
# Send email/Slack notification

Estimated RTO: 4 hours Estimated RPO: 15 minutes (last WAL archive)


Scenario 2: Accidental Data Deletion

Symptoms:

  • Critical data deleted by user error
  • "DELETE FROM sales.orders WHERE..." without WHERE clause

Recovery Steps:

# 1. Identify deletion time
SELECT * FROM system.audit_logs WHERE event = 'DELETE' AND table_name = 'sales.orders' ORDER BY created_at DESC LIMIT 10;

# 2. Use PITR to restore to before deletion
./scripts/pitr-restore.sh --target-time="2025-11-24 14:25:00"

# 3. Export deleted records from restored database
psql -h restored-db -U erp_user -d erp_generic -c "COPY (SELECT * FROM sales.orders WHERE created_at >= '2025-11-24 14:00:00') TO '/tmp/recovered_orders.csv' CSV HEADER;"

# 4. Import recovered records to production
psql -h postgres -U erp_user -d erp_generic -c "\COPY sales.orders FROM '/tmp/recovered_orders.csv' CSV HEADER;"

echo "✓ Deleted records recovered"

Estimated RTO: 1 hour Estimated RPO: 0 (no data loss if caught quickly)


Scenario 3: Database Corruption

Symptoms:

  • PostgreSQL fails to start
  • "corrupt page" errors in logs
  • Data inconsistencies

Recovery Steps:

# 1. Attempt automatic repair
docker-compose exec postgres pg_resetwal /var/lib/postgresql/data

# 2. If repair fails, restore from backup
./scripts/restore-postgres.sh /backups/postgres/full_20251124_020000.dump

# 3. Run VACUUM and ANALYZE
psql -h postgres -U erp_user -d erp_generic -c "VACUUM FULL; ANALYZE;"

# 4. Rebuild indexes
psql -h postgres -U erp_user -d erp_generic -c "REINDEX DATABASE erp_generic;"

9. BACKUP TESTING

9.1 Monthly Restore Test

File: scripts/test-restore.sh

#!/bin/bash
# Monthly backup restore test

BACKUP_FILE=$(ls -t /backups/postgres/full_*.dump | head -1)
TEST_DB="erp_generic_restore_test"

echo "===== Backup Restore Test ====="
echo "Backup: $BACKUP_FILE"

# 1. Drop test database if exists
psql -h postgres -U erp_user -d postgres -c "DROP DATABASE IF EXISTS $TEST_DB;"

# 2. Create test database
psql -h postgres -U erp_user -d postgres -c "CREATE DATABASE $TEST_DB;"

# 3. Restore backup
pg_restore -h postgres -U erp_user -d $TEST_DB --clean --if-exists "$BACKUP_FILE"

# 4. Run smoke tests
psql -h postgres -U erp_user -d $TEST_DB <<SQL
-- Verify table counts
SELECT COUNT(*) AS user_count FROM auth.users;
SELECT COUNT(*) AS partner_count FROM core.partners;
SELECT COUNT(*) AS order_count FROM sales.orders;

-- Verify data integrity
SELECT 'OK' AS status WHERE (
    SELECT COUNT(*) FROM auth.users
) > 0;
SQL

# 5. Cleanup
psql -h postgres -U erp_user -d postgres -c "DROP DATABASE $TEST_DB;"

echo "✓ Backup restore test passed"

9.2 Quarterly DR Drill

Checklist:

  • Provision new infrastructure (staging environment)
  • Restore from cloud backup (S3)
  • Verify all 9 schemas restored
  • Run full test suite (unit + integration + E2E)
  • Measure RTO (actual time to restore)
  • Measure RPO (data loss amount)
  • Document findings and improvements
  • Update DR playbook

Success Criteria:

  • RTO < 4 hours
  • RPO < 15 minutes
  • All tests passing
  • Zero critical data loss

10. REFERENCES

Internal Documentation:

External Resources:


Documento: BACKUP-RECOVERY.md Versión: 1.0 Total Páginas: ~12 Última Actualización: 2025-11-24