erp-core/docs/07-devops/BACKUP-RECOVERY.md

# BACKUP & RECOVERY - ERP Generic

**Última actualización:** 2025-11-24
**Responsable:** DevOps Team / DBA Team
**Estado:** ✅ Production-Ready

---

## TABLE OF CONTENTS

1. [Overview](#1-overview)
2. [Backup Strategy](#2-backup-strategy)
3. [Backup Scripts](#3-backup-scripts)
4. [Multi-Tenant Backup Isolation](#4-multi-tenant-backup-isolation)
5. [Retention Policy](#5-retention-policy)
6. [Recovery Procedures](#6-recovery-procedures)
7. [Point-in-Time Recovery (PITR)](#7-point-in-time-recovery-pitr)
8. [Disaster Recovery Playbook](#8-disaster-recovery-playbook)
9. [Backup Testing](#9-backup-testing)
10. [References](#10-references)

---

## 1. OVERVIEW

### 1.1 Backup Architecture

```
┌─────────────────────────────────────────────────────────────┐
│                   PostgreSQL 16 Database                    │
│  ┌──────────┐  ┌──────────┐  ┌──────────┐  ┌──────────┐   │
│  │  auth    │  │  core    │  │ financial│  │ inventory│   │
│  │  schema  │  │  schema  │  │  schema  │  │  schema  │   │
│  └────┬─────┘  └────┬─────┘  └────┬─────┘  └────┬─────┘   │
│       │             │             │             │           │
│  ┌──────────┐  ┌──────────┐  ┌──────────┐  ┌──────────┐   │
│  │ purchase │  │  sales   │  │ analytics│  │ projects │   │
│  │  schema  │  │  schema  │  │  schema  │  │  schema  │   │
│  └────┬─────┘  └────┬─────┘  └────┬─────┘  └────┬─────┘   │
│       │             │             │             │           │
│  ┌──────────┐                                               │
│  │  system  │                                               │
│  │  schema  │                                               │
│  └────┬─────┘                                               │
└───────┼───────────────────────────────────────────────────┘
        │
        ↓ (Automated backup every 4 hours)
┌─────────────────────────────────────────────────────────────┐
│              Local Backup Storage (/backups)                │
│  ┌──────────────┐  ┌──────────────┐  ┌──────────────┐     │
│  │ Full Backup  │  │  Incremental │  │   Per-Schema │     │
│  │   (Daily)    │  │  (4 hours)   │  │    Backups   │     │
│  └──────┬───────┘  └──────┬───────┘  └──────┬───────┘     │
└─────────┼──────────────────┼──────────────────┼─────────────┘
          │                  │                  │
          │ (Sync every hour)│                  │
          ↓                  ↓                  ↓
┌─────────────────────────────────────────────────────────────┐
│           Cloud Storage (S3 / Azure Blob / GCS)             │
│  ┌──────────────────────────────────────────────────────┐  │
│  │  Versioning Enabled | Lifecycle Rules | Encryption   │  │
│  │  Retention: 7d + 4w + 12m                            │  │
│  └──────────────────────────────────────────────────────┘  │
└─────────────────────────────────────────────────────────────┘

┌─────────────────────────────────────────────────────────────┐
│              WAL (Write-Ahead Logs) Archive                 │
│  Continuous archiving for Point-in-Time Recovery (PITR)    │
│  Retention: 7 days                                          │
└─────────────────────────────────────────────────────────────┘
```

### 1.2 Backup Objectives

**RTO (Recovery Time Objective):** 4 hours
- Maximum time to restore service after failure

**RPO (Recovery Point Objective):** 15 minutes
- Maximum data loss acceptable (via WAL archiving)

**Data Durability:** 99.999999999% (11 nines)
- Multi-region cloud storage replication

**Backup Types:**
1. **Full Backup:** Complete database snapshot (daily at 2:00 AM)
2. **Incremental Backup:** Changes since last full backup (every 4 hours)
3. **WAL Archive:** Continuous archiving for PITR (every 60 seconds)
4. **Per-Schema Backup:** Individual schema backups for multi-tenant isolation (daily)

---

## 2. BACKUP STRATEGY

### 2.1 Backup Schedule

| Backup Type | Frequency | Retention | Size (Est.) | Duration |
|-------------|-----------|-----------|-------------|----------|
| **Full Backup** | Daily (2:00 AM) | 7 days local + 12 months cloud | 50-100 GB | 30-45 min |
| **Incremental Backup** | Every 4 hours | 7 days local | 5-10 GB | 5-10 min |
| **WAL Archive** | Continuous (every 16 MB) | 7 days | 100-200 GB/week | Real-time |
| **Per-Schema Backup** | Daily (3:00 AM) | 7 days local + 4 weeks cloud | 5-15 GB each | 5-10 min each |
| **Config Backup** | On change + daily | 30 days | <100 MB | <1 min |

### 2.2 Backup Storage Locations

**Primary Storage (Local):**
- **Path:** `/backups/postgres/`
- **Filesystem:** XFS (optimized for large files)
- **Capacity:** 500 GB minimum
- **RAID:** RAID 10 (performance + redundancy)

**Secondary Storage (Cloud):**
- **Provider:** AWS S3 / Azure Blob Storage / Google Cloud Storage
- **Bucket:** `erp-generic-backups-prod`
- **Region:** Multi-region replication (e.g., us-east-1 + us-west-2)
- **Encryption:** AES-256 at rest
- **Versioning:** Enabled (30 versions max)
- **Lifecycle Rules:**
  - Move to Glacier after 30 days
  - Delete after 1 year (except annual backups)

**Tertiary Storage (Offsite):**
- **Type:** Tape backup / Cold storage
- **Frequency:** Monthly
- **Retention:** 7 years (compliance requirement)

### 2.3 Backup Validation

**Automated Validation (Daily):**
```bash
# 1. Verify backup file integrity
md5sum /backups/postgres/full_20251124_020000.dump > /backups/postgres/full_20251124_020000.dump.md5

# 2. Test restore to staging environment (weekly)
pg_restore --dbname=erp_generic_staging --clean --if-exists /backups/postgres/full_20251124_020000.dump

# 3. Run smoke tests on restored database
psql -d erp_generic_staging -c "SELECT COUNT(*) FROM auth.users;"
psql -d erp_generic_staging -c "SELECT COUNT(*) FROM core.partners;"

# 4. Compare row counts with production
diff <(psql -d erp_generic -tAc "SELECT tablename, n_live_tup FROM pg_stat_user_tables ORDER BY tablename") \
     <(psql -d erp_generic_staging -tAc "SELECT tablename, n_live_tup FROM pg_stat_user_tables ORDER BY tablename")
```

**Manual Validation (Monthly):**
- Full disaster recovery drill
- Restore to isolated environment
- Verify business-critical data
- Test application functionality
- Document findings in post-mortem

---

## 3. BACKUP SCRIPTS

### 3.1 Full Backup Script

**File:** `scripts/backup-postgres.sh`

```bash
#!/bin/bash
# =====================================================
# ERP GENERIC - PostgreSQL Full Backup Script
# Performs full database backup with multi-schema support
# =====================================================

set -euo pipefail

# Configuration
BACKUP_DIR="/backups/postgres"
TIMESTAMP=$(date +"%Y%m%d_%H%M%S")
RETENTION_DAYS=7
DB_HOST="${POSTGRES_HOST:-postgres}"
DB_PORT="${POSTGRES_PORT:-5432}"
DB_NAME="${POSTGRES_DB:-erp_generic}"
DB_USER="${POSTGRES_USER:-erp_user}"
PGPASSWORD="${POSTGRES_PASSWORD}"
export PGPASSWORD

# Logging
LOG_FILE="/var/log/erp-generic/backup.log"
exec > >(tee -a "$LOG_FILE")
exec 2>&1

echo "===== PostgreSQL Backup Started at $(date) ====="

# Create backup directory if not exists
mkdir -p "$BACKUP_DIR"

# 1. Full Database Backup
echo "1. Creating full database backup..."
FULL_BACKUP_FILE="${BACKUP_DIR}/full_${TIMESTAMP}.dump"
pg_dump -h "$DB_HOST" -p "$DB_PORT" -U "$DB_USER" -Fc -v -d "$DB_NAME" -f "$FULL_BACKUP_FILE"

if [ $? -eq 0 ]; then
    echo "✓ Full backup created: $FULL_BACKUP_FILE"
    FILE_SIZE=$(du -h "$FULL_BACKUP_FILE" | cut -f1)
    echo "  Size: $FILE_SIZE"
else
    echo "✗ Full backup failed!"
    exit 1
fi

# 2. Generate MD5 checksum
echo "2. Generating checksum..."
md5sum "$FULL_BACKUP_FILE" > "${FULL_BACKUP_FILE}.md5"
echo "✓ Checksum saved: ${FULL_BACKUP_FILE}.md5"

# 3. Per-Schema Backups (Multi-Tenant Isolation)
echo "3. Creating per-schema backups..."
SCHEMAS=("auth" "core" "financial" "inventory" "purchase" "sales" "analytics" "projects" "system")

for schema in "${SCHEMAS[@]}"; do
    SCHEMA_BACKUP_FILE="${BACKUP_DIR}/${schema}_${TIMESTAMP}.dump"
    echo "  Backing up schema: $schema"

    pg_dump -h "$DB_HOST" -p "$DB_PORT" -U "$DB_USER" -Fc -n "$schema" -d "$DB_NAME" -f "$SCHEMA_BACKUP_FILE"

    if [ $? -eq 0 ]; then
        SCHEMA_SIZE=$(du -h "$SCHEMA_BACKUP_FILE" | cut -f1)
        echo "  ✓ $schema backup created ($SCHEMA_SIZE)"
    else
        echo "  ✗ $schema backup failed!"
    fi
done

# 4. Backup Database Roles and Permissions
echo "4. Backing up database roles..."
ROLES_BACKUP_FILE="${BACKUP_DIR}/roles_${TIMESTAMP}.sql"
pg_dumpall -h "$DB_HOST" -p "$DB_PORT" -U "$DB_USER" --roles-only -f "$ROLES_BACKUP_FILE"
echo "✓ Roles backup created: $ROLES_BACKUP_FILE"

# 5. Backup PostgreSQL Configuration
echo "5. Backing up PostgreSQL configuration..."
CONFIG_BACKUP_DIR="${BACKUP_DIR}/config_${TIMESTAMP}"
mkdir -p "$CONFIG_BACKUP_DIR"

# Copy config files (if accessible)
if [ -f /etc/postgresql/16/main/postgresql.conf ]; then
    cp /etc/postgresql/16/main/postgresql.conf "$CONFIG_BACKUP_DIR/"
    cp /etc/postgresql/16/main/pg_hba.conf "$CONFIG_BACKUP_DIR/"
    echo "✓ Config files backed up"
fi

# 6. Backup WAL Archive Status
echo "6. Recording WAL archive status..."
psql -h "$DB_HOST" -p "$DB_PORT" -U "$DB_USER" -d "$DB_NAME" -tAc "SELECT * FROM pg_stat_archiver;" > "${BACKUP_DIR}/wal_status_${TIMESTAMP}.txt"

# 7. Upload to Cloud Storage (S3)
if [ -n "${AWS_S3_BUCKET:-}" ]; then
    echo "7. Uploading to S3..."
    aws s3 cp "$FULL_BACKUP_FILE" "s3://${AWS_S3_BUCKET}/postgres/${TIMESTAMP}/" --storage-class STANDARD_IA
    aws s3 cp "${FULL_BACKUP_FILE}.md5" "s3://${AWS_S3_BUCKET}/postgres/${TIMESTAMP}/"

    # Upload per-schema backups
    for schema in "${SCHEMAS[@]}"; do
        SCHEMA_BACKUP_FILE="${BACKUP_DIR}/${schema}_${TIMESTAMP}.dump"
        if [ -f "$SCHEMA_BACKUP_FILE" ]; then
            aws s3 cp "$SCHEMA_BACKUP_FILE" "s3://${AWS_S3_BUCKET}/postgres/${TIMESTAMP}/schemas/" --storage-class STANDARD_IA
        fi
    done

    echo "✓ Backup uploaded to S3"
fi

# 8. Cleanup Old Backups (Local)
echo "8. Cleaning up old backups (older than $RETENTION_DAYS days)..."
find "$BACKUP_DIR" -type f -name "*.dump" -mtime +$RETENTION_DAYS -delete
find "$BACKUP_DIR" -type f -name "*.sql" -mtime +$RETENTION_DAYS -delete
find "$BACKUP_DIR" -type f -name "*.md5" -mtime +$RETENTION_DAYS -delete
find "$BACKUP_DIR" -type d -name "config_*" -mtime +$RETENTION_DAYS -exec rm -rf {} + 2>/dev/null || true
echo "✓ Old backups cleaned up"

# 9. Verify Backup Integrity
echo "9. Verifying backup integrity..."
md5sum -c "${FULL_BACKUP_FILE}.md5"
if [ $? -eq 0 ]; then
    echo "✓ Backup integrity verified"
else
    echo "✗ Backup integrity check failed!"
    exit 1
fi

# 10. Send Notification
echo "10. Sending backup notification..."
BACKUP_SIZE=$(du -sh "$BACKUP_DIR" | cut -f1)

# Slack notification (optional)
if [ -n "${SLACK_WEBHOOK_URL:-}" ]; then
    curl -X POST "$SLACK_WEBHOOK_URL" \
        -H 'Content-Type: application/json' \
        -d "{\"text\": \"✅ PostgreSQL backup completed successfully\n• Database: $DB_NAME\n• Size: $FILE_SIZE\n• Timestamp: $TIMESTAMP\n• Total backup dir size: $BACKUP_SIZE\"}"
fi

echo "===== PostgreSQL Backup Completed at $(date) ====="
echo "Total backup size: $BACKUP_SIZE"
echo "Backup location: $BACKUP_DIR"

# Exit successfully
exit 0
```

### 3.2 Incremental Backup Script

**File:** `scripts/backup-postgres-incremental.sh`

```bash
#!/bin/bash
# =====================================================
# ERP GENERIC - PostgreSQL Incremental Backup Script
# Uses pg_dump with --snapshot for consistency
# =====================================================

set -euo pipefail

BACKUP_DIR="/backups/postgres/incremental"
TIMESTAMP=$(date +"%Y%m%d_%H%M%S")
DB_HOST="${POSTGRES_HOST:-postgres}"
DB_NAME="${POSTGRES_DB:-erp_generic}"
DB_USER="${POSTGRES_USER:-erp_user}"
PGPASSWORD="${POSTGRES_PASSWORD}"
export PGPASSWORD

mkdir -p "$BACKUP_DIR"

echo "===== Incremental Backup Started at $(date) ====="

# Get last full backup timestamp
LAST_FULL_BACKUP=$(ls -t /backups/postgres/full_*.dump | head -1 | grep -oP '\d{8}_\d{6}')
echo "Last full backup: $LAST_FULL_BACKUP"

# Incremental backup: Export only changed data (simplified approach)
# In production, consider using WAL-based incremental backups or pg_basebackup

INCREMENTAL_FILE="${BACKUP_DIR}/incremental_${TIMESTAMP}.dump"
pg_dump -h "$DB_HOST" -U "$DB_USER" -Fc -d "$DB_NAME" -f "$INCREMENTAL_FILE"

echo "✓ Incremental backup created: $INCREMENTAL_FILE"
echo "===== Incremental Backup Completed at $(date) ====="
```

### 3.3 WAL Archiving Configuration

**File:** `postgresql.conf` (WAL archiving section)

```ini
# WAL Settings for Point-in-Time Recovery
wal_level = replica
archive_mode = on
archive_command = 'test ! -f /backups/wal/%f && cp %p /backups/wal/%f && aws s3 cp /backups/wal/%f s3://${AWS_S3_BUCKET}/wal/'
archive_timeout = 60s
max_wal_senders = 3
wal_keep_size = 1GB
```

### 3.4 Cron Schedule

**File:** `/etc/cron.d/erp-backup`

```cron
# ERP Generic Backup Schedule

# Full backup daily at 2:00 AM
0 2 * * * root /opt/erp-generic/scripts/backup-postgres.sh >> /var/log/erp-generic/backup.log 2>&1

# Incremental backup every 4 hours
0 */4 * * * root /opt/erp-generic/scripts/backup-postgres-incremental.sh >> /var/log/erp-generic/backup.log 2>&1

# Verify backups daily at 4:00 AM
0 4 * * * root /opt/erp-generic/scripts/verify-backup.sh >> /var/log/erp-generic/backup.log 2>&1

# Cleanup old WAL files daily at 5:00 AM
0 5 * * * root find /backups/wal -type f -mtime +7 -delete

# Weekly full disaster recovery test (Sundays at 3:00 AM)
0 3 * * 0 root /opt/erp-generic/scripts/test-restore.sh >> /var/log/erp-generic/backup-test.log 2>&1
```

---

## 4. MULTI-TENANT BACKUP ISOLATION

### 4.1 Per-Tenant Backup Strategy

**Why Per-Schema Backups?**
- Restore individual tenant without affecting others
- Compliance: GDPR right to erasure (delete tenant data)
- Tenant migration to dedicated instance
- Faster restore times for single tenant issues

**Backup Structure:**
```
/backups/postgres/
├── full_20251124_020000.dump              # All schemas
├── auth_20251124_020000.dump              # Auth schema only
├── core_20251124_020000.dump              # Core schema only
├── financial_20251124_020000.dump         # Financial schema only
├── inventory_20251124_020000.dump         # Inventory schema only
├── purchase_20251124_020000.dump          # Purchase schema only
├── sales_20251124_020000.dump             # Sales schema only
├── analytics_20251124_020000.dump         # Analytics schema only
├── projects_20251124_020000.dump          # Projects schema only
└── system_20251124_020000.dump            # System schema only
```

### 4.2 Restore Single Tenant

```bash
#!/bin/bash
# Restore single tenant (schema isolation)

TENANT_ID="tenant-abc"
SCHEMA_NAME="financial"  # Example: restore financial schema only
BACKUP_FILE="/backups/postgres/financial_20251124_020000.dump"

echo "Restoring schema $SCHEMA_NAME for tenant $TENANT_ID..."

# Option 1: Drop and restore entire schema
psql -h postgres -U erp_user -d erp_generic -c "DROP SCHEMA IF EXISTS $SCHEMA_NAME CASCADE;"
pg_restore -h postgres -U erp_user -d erp_generic -n $SCHEMA_NAME --clean --if-exists "$BACKUP_FILE"

# Option 2: Restore to temporary schema, then copy tenant-specific data
pg_restore -h postgres -U erp_user -d erp_generic -n ${SCHEMA_NAME}_temp --create "$BACKUP_FILE"

# Copy tenant-specific data
psql -h postgres -U erp_user -d erp_generic <<SQL
-- Copy tenant data from temp schema
INSERT INTO $SCHEMA_NAME.accounts
SELECT * FROM ${SCHEMA_NAME}_temp.accounts WHERE tenant_id = '$TENANT_ID';

-- Drop temp schema
DROP SCHEMA ${SCHEMA_NAME}_temp CASCADE;
SQL

echo "✓ Tenant $TENANT_ID restored from $SCHEMA_NAME schema"
```

---

## 5. RETENTION POLICY

### 5.1 Retention Rules

**Local Storage (Fast Recovery):**
- **Full Backups:** 7 days
- **Incremental Backups:** 7 days
- **WAL Archives:** 7 days
- **Config Backups:** 30 days

**Cloud Storage (Long-term Retention):**
- **Daily Backups:** 30 days (Standard storage)
- **Weekly Backups:** 12 weeks (Standard-IA storage)
- **Monthly Backups:** 12 months (Glacier storage)
- **Annual Backups:** 7 years (Glacier Deep Archive) - Compliance

**Lifecycle Policy (AWS S3 Example):**
```json
{
  "Rules": [
    {
      "Id": "TransitionToIA",
      "Status": "Enabled",
      "Transitions": [
        {
          "Days": 30,
          "StorageClass": "STANDARD_IA"
        }
      ]
    },
    {
      "Id": "TransitionToGlacier",
      "Status": "Enabled",
      "Transitions": [
        {
          "Days": 90,
          "StorageClass": "GLACIER"
        }
      ]
    },
    {
      "Id": "DeleteOldBackups",
      "Status": "Enabled",
      "Expiration": {
        "Days": 365
      },
      "Filter": {
        "Prefix": "postgres/daily/"
      }
    }
  ]
}
```

### 5.2 Compliance Requirements

**GDPR (EU):**
- Right to erasure: Ability to delete tenant data
- Backup encryption: AES-256
- Access logging: Who accessed what backup

**SOC 2:**
- Backup testing: Monthly restore drills
- Access controls: Role-based access to backups
- Audit trail: Log all backup/restore operations

**Industry-Specific:**
- Healthcare (HIPAA): 6 years retention
- Financial (SOX): 7 years retention
- Government: Variable (check local regulations)

---

## 6. RECOVERY PROCEDURES

### 6.1 Full Database Restore

**File:** `scripts/restore-postgres.sh`

```bash
#!/bin/bash
# =====================================================
# ERP GENERIC - PostgreSQL Restore Script
# Restores database from backup file
# =====================================================

set -euo pipefail

# Usage
if [ $# -lt 1 ]; then
    echo "Usage: $0 <backup_file> [--target=<database>] [--no-prompt]"
    echo "Example: $0 /backups/postgres/full_20251124_020000.dump --target=erp_generic_staging"
    exit 1
fi

BACKUP_FILE="$1"
TARGET_DB="${2:-erp_generic}"
NO_PROMPT=false

# Parse arguments
for arg in "$@"; do
    case $arg in
        --target=*)
            TARGET_DB="${arg#*=}"
            shift
            ;;
        --no-prompt)
            NO_PROMPT=true
            shift
            ;;
    esac
done

DB_HOST="${POSTGRES_HOST:-postgres}"
DB_USER="${POSTGRES_USER:-erp_user}"
PGPASSWORD="${POSTGRES_PASSWORD}"
export PGPASSWORD

echo "===== PostgreSQL Restore Started at $(date) ====="
echo "Backup file: $BACKUP_FILE"
echo "Target database: $TARGET_DB"

# Verify backup file exists
if [ ! -f "$BACKUP_FILE" ]; then
    echo "✗ Backup file not found: $BACKUP_FILE"
    exit 1
fi

# Verify checksum if exists
if [ -f "${BACKUP_FILE}.md5" ]; then
    echo "Verifying backup integrity..."
    md5sum -c "${BACKUP_FILE}.md5"
    if [ $? -ne 0 ]; then
        echo "✗ Backup integrity check failed!"
        exit 1
    fi
    echo "✓ Backup integrity verified"
fi

# Safety prompt (unless --no-prompt)
if [ "$NO_PROMPT" = false ]; then
    echo ""
    echo "⚠️  WARNING: This will OVERWRITE all data in database '$TARGET_DB'"
    echo "⚠️  Make sure you have a recent backup before proceeding!"
    echo ""
    read -p "Are you sure you want to continue? (yes/no): " -r
    if [[ ! $REPLY =~ ^[Yy][Ee][Ss]$ ]]; then
        echo "Restore cancelled by user."
        exit 0
    fi
fi

# Create safety backup of target database (if not staging/dev)
if [[ ! "$TARGET_DB" =~ (staging|dev|test) ]]; then
    echo "Creating safety backup of $TARGET_DB before restore..."
    SAFETY_BACKUP="/backups/postgres/safety_${TARGET_DB}_$(date +%Y%m%d_%H%M%S).dump"
    pg_dump -h "$DB_HOST" -U "$DB_USER" -Fc -d "$TARGET_DB" -f "$SAFETY_BACKUP"
    echo "✓ Safety backup created: $SAFETY_BACKUP"
fi

# Terminate active connections to target database
echo "Terminating active connections to $TARGET_DB..."
psql -h "$DB_HOST" -U "$DB_USER" -d postgres <<SQL
SELECT pg_terminate_backend(pid)
FROM pg_stat_activity
WHERE datname = '$TARGET_DB' AND pid <> pg_backend_pid();
SQL

# Restore database
echo "Restoring database from $BACKUP_FILE..."
pg_restore -h "$DB_HOST" -U "$DB_USER" -d "$TARGET_DB" --clean --if-exists --verbose "$BACKUP_FILE"

if [ $? -eq 0 ]; then
    echo "✓ Database restored successfully"
else
    echo "✗ Restore failed!"
    exit 1
fi

# Verify restore
echo "Verifying restore..."
TABLE_COUNT=$(psql -h "$DB_HOST" -U "$DB_USER" -d "$TARGET_DB" -tAc "SELECT COUNT(*) FROM information_schema.tables WHERE table_schema NOT IN ('pg_catalog', 'information_schema');")
USER_COUNT=$(psql -h "$DB_HOST" -U "$DB_USER" -d "$TARGET_DB" -tAc "SELECT COUNT(*) FROM auth.users;" 2>/dev/null || echo "N/A")

echo "Tables restored: $TABLE_COUNT"
echo "Users in auth.users: $USER_COUNT"

# Rebuild statistics
echo "Rebuilding database statistics..."
psql -h "$DB_HOST" -U "$DB_USER" -d "$TARGET_DB" -c "ANALYZE;"
echo "✓ Statistics rebuilt"

echo "===== PostgreSQL Restore Completed at $(date) ====="
echo ""
echo "Next steps:"
echo "1. Verify application functionality"
echo "2. Run smoke tests: npm run test:smoke"
echo "3. Check logs for errors"
echo "4. Notify team that restore is complete"

exit 0
```

### 6.2 Schema-Level Restore

```bash
# Restore single schema
SCHEMA="financial"
BACKUP_FILE="/backups/postgres/financial_20251124_020000.dump"

pg_restore -h postgres -U erp_user -d erp_generic -n $SCHEMA --clean --if-exists "$BACKUP_FILE"
```

### 6.3 Table-Level Restore

```bash
# Restore single table
TABLE="auth.users"
BACKUP_FILE="/backups/postgres/full_20251124_020000.dump"

pg_restore -h postgres -U erp_user -d erp_generic -t $TABLE --data-only "$BACKUP_FILE"
```

---

## 7. POINT-IN-TIME RECOVERY (PITR)

### 7.1 PITR Process

**Use Case:** Restore database to specific point in time (e.g., before accidental DELETE query)

**Requirements:**
- Base backup (full backup)
- WAL archives from base backup to target time

**Steps:**

```bash
#!/bin/bash
# Point-in-Time Recovery (PITR)

TARGET_TIME="2025-11-24 14:30:00"
BASE_BACKUP="/backups/postgres/full_20251124_020000.dump"
WAL_ARCHIVE_DIR="/backups/wal"

echo "===== Point-in-Time Recovery to $TARGET_TIME ====="

# 1. Stop PostgreSQL
docker-compose stop postgres

# 2. Backup current data (safety)
mv /data/postgres /data/postgres_backup_$(date +%Y%m%d_%H%M%S)

# 3. Restore base backup
mkdir -p /data/postgres
pg_basebackup -h postgres -U erp_user -D /data/postgres -Fp -Xs -P

# 4. Create recovery configuration
cat > /data/postgres/recovery.conf <<EOF
restore_command = 'cp /backups/wal/%f %p'
recovery_target_time = '$TARGET_TIME'
recovery_target_action = 'promote'
EOF

# 5. Start PostgreSQL (will enter recovery mode)
docker-compose start postgres

# 6. Monitor recovery
tail -f /var/log/postgresql/postgresql-*.log | grep -i recovery

echo "Recovery complete! Database restored to $TARGET_TIME"
```

### 7.2 PITR Best Practices

- **Test PITR quarterly** in non-production environment
- **Document recovery times** (usually 30 min - 2 hours depending on WAL size)
- **Automate WAL archiving** to S3/cloud storage
- **Monitor WAL archive lag** (should be <5 minutes)

---

## 8. DISASTER RECOVERY PLAYBOOK

### 8.1 Disaster Scenarios

**Scenario 1: Complete Data Center Failure**

**Symptoms:**
- All servers unreachable
- Network connectivity lost
- Physical infrastructure damaged

**Recovery Steps:**

```bash
# 1. Provision new infrastructure (AWS, Azure, GCP)
terraform apply -var-file=disaster-recovery.tfvars

# 2. Download latest backup from cloud storage
aws s3 sync s3://erp-generic-backups-prod/postgres/ /backups/postgres/ --exclude "*" --include "full_*"

# 3. Deploy Docker containers
docker-compose -f docker-compose.prod.yml up -d postgres redis

# 4. Restore database
LATEST_BACKUP=$(ls -t /backups/postgres/full_*.dump | head -1)
./scripts/restore-postgres.sh "$LATEST_BACKUP" --no-prompt

# 5. Restore WAL archives for PITR
aws s3 sync s3://erp-generic-backups-prod/wal/ /backups/wal/

# 6. Deploy application
docker-compose -f docker-compose.prod.yml up -d backend frontend nginx

# 7. Update DNS (point to new infrastructure)
# Manual step or use Route53/CloudFlare API

# 8. Verify functionality
./scripts/health-check.sh
npm run test:smoke

# 9. Notify stakeholders
# Send email/Slack notification
```

**Estimated RTO:** 4 hours
**Estimated RPO:** 15 minutes (last WAL archive)

---

**Scenario 2: Accidental Data Deletion**

**Symptoms:**
- Critical data deleted by user error
- "DELETE FROM sales.orders WHERE..." without WHERE clause

**Recovery Steps:**

```bash
# 1. Identify deletion time
SELECT * FROM system.audit_logs WHERE event = 'DELETE' AND table_name = 'sales.orders' ORDER BY created_at DESC LIMIT 10;

# 2. Use PITR to restore to before deletion
./scripts/pitr-restore.sh --target-time="2025-11-24 14:25:00"

# 3. Export deleted records from restored database
psql -h restored-db -U erp_user -d erp_generic -c "COPY (SELECT * FROM sales.orders WHERE created_at >= '2025-11-24 14:00:00') TO '/tmp/recovered_orders.csv' CSV HEADER;"

# 4. Import recovered records to production
psql -h postgres -U erp_user -d erp_generic -c "\COPY sales.orders FROM '/tmp/recovered_orders.csv' CSV HEADER;"

echo "✓ Deleted records recovered"
```

**Estimated RTO:** 1 hour
**Estimated RPO:** 0 (no data loss if caught quickly)

---

**Scenario 3: Database Corruption**

**Symptoms:**
- PostgreSQL fails to start
- "corrupt page" errors in logs
- Data inconsistencies

**Recovery Steps:**

```bash
# 1. Attempt automatic repair
docker-compose exec postgres pg_resetwal /var/lib/postgresql/data

# 2. If repair fails, restore from backup
./scripts/restore-postgres.sh /backups/postgres/full_20251124_020000.dump

# 3. Run VACUUM and ANALYZE
psql -h postgres -U erp_user -d erp_generic -c "VACUUM FULL; ANALYZE;"

# 4. Rebuild indexes
psql -h postgres -U erp_user -d erp_generic -c "REINDEX DATABASE erp_generic;"
```

---

## 9. BACKUP TESTING

### 9.1 Monthly Restore Test

**File:** `scripts/test-restore.sh`

```bash
#!/bin/bash
# Monthly backup restore test

BACKUP_FILE=$(ls -t /backups/postgres/full_*.dump | head -1)
TEST_DB="erp_generic_restore_test"

echo "===== Backup Restore Test ====="
echo "Backup: $BACKUP_FILE"

# 1. Drop test database if exists
psql -h postgres -U erp_user -d postgres -c "DROP DATABASE IF EXISTS $TEST_DB;"

# 2. Create test database
psql -h postgres -U erp_user -d postgres -c "CREATE DATABASE $TEST_DB;"

# 3. Restore backup
pg_restore -h postgres -U erp_user -d $TEST_DB --clean --if-exists "$BACKUP_FILE"

# 4. Run smoke tests
psql -h postgres -U erp_user -d $TEST_DB <<SQL
-- Verify table counts
SELECT COUNT(*) AS user_count FROM auth.users;
SELECT COUNT(*) AS partner_count FROM core.partners;
SELECT COUNT(*) AS order_count FROM sales.orders;

-- Verify data integrity
SELECT 'OK' AS status WHERE (
    SELECT COUNT(*) FROM auth.users
) > 0;
SQL

# 5. Cleanup
psql -h postgres -U erp_user -d postgres -c "DROP DATABASE $TEST_DB;"

echo "✓ Backup restore test passed"
```

### 9.2 Quarterly DR Drill

**Checklist:**
- [ ] Provision new infrastructure (staging environment)
- [ ] Restore from cloud backup (S3)
- [ ] Verify all 9 schemas restored
- [ ] Run full test suite (unit + integration + E2E)
- [ ] Measure RTO (actual time to restore)
- [ ] Measure RPO (data loss amount)
- [ ] Document findings and improvements
- [ ] Update DR playbook

**Success Criteria:**
- RTO < 4 hours
- RPO < 15 minutes
- All tests passing
- Zero critical data loss

---

## 10. REFERENCES

**Internal Documentation:**
- [Deployment Guide](./DEPLOYMENT-GUIDE.md)
- [Monitoring & Observability](./MONITORING-OBSERVABILITY.md)
- [Database Schemas](../02-modelado/database-design/schemas/)

**External Resources:**
- [PostgreSQL Backup Documentation](https://www.postgresql.org/docs/16/backup.html)
- [PostgreSQL PITR](https://www.postgresql.org/docs/16/continuous-archiving.html)
- [AWS RDS Backup Best Practices](https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/USER_WorkingWithAutomatedBackups.html)

---

**Documento:** BACKUP-RECOVERY.md
**Versión:** 1.0
**Total Páginas:** ~12
**Última Actualización:** 2025-11-24