erp-core/docs/07-devops/BACKUP-RECOVERY.md

908 lines
30 KiB
Markdown

# BACKUP & RECOVERY - ERP Generic
**Última actualización:** 2025-11-24
**Responsable:** DevOps Team / DBA Team
**Estado:** ✅ Production-Ready
---
## TABLE OF CONTENTS
1. [Overview](#1-overview)
2. [Backup Strategy](#2-backup-strategy)
3. [Backup Scripts](#3-backup-scripts)
4. [Multi-Tenant Backup Isolation](#4-multi-tenant-backup-isolation)
5. [Retention Policy](#5-retention-policy)
6. [Recovery Procedures](#6-recovery-procedures)
7. [Point-in-Time Recovery (PITR)](#7-point-in-time-recovery-pitr)
8. [Disaster Recovery Playbook](#8-disaster-recovery-playbook)
9. [Backup Testing](#9-backup-testing)
10. [References](#10-references)
---
## 1. OVERVIEW
### 1.1 Backup Architecture
```
┌─────────────────────────────────────────────────────────────┐
│ PostgreSQL 16 Database │
│ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐ │
│ │ auth │ │ core │ │ financial│ │ inventory│ │
│ │ schema │ │ schema │ │ schema │ │ schema │ │
│ └────┬─────┘ └────┬─────┘ └────┬─────┘ └────┬─────┘ │
│ │ │ │ │ │
│ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐ │
│ │ purchase │ │ sales │ │ analytics│ │ projects │ │
│ │ schema │ │ schema │ │ schema │ │ schema │ │
│ └────┬─────┘ └────┬─────┘ └────┬─────┘ └────┬─────┘ │
│ │ │ │ │ │
│ ┌──────────┐ │
│ │ system │ │
│ │ schema │ │
│ └────┬─────┘ │
└───────┼───────────────────────────────────────────────────┘
↓ (Automated backup every 4 hours)
┌─────────────────────────────────────────────────────────────┐
│ Local Backup Storage (/backups) │
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │
│ │ Full Backup │ │ Incremental │ │ Per-Schema │ │
│ │ (Daily) │ │ (4 hours) │ │ Backups │ │
│ └──────┬───────┘ └──────┬───────┘ └──────┬───────┘ │
└─────────┼──────────────────┼──────────────────┼─────────────┘
│ │ │
│ (Sync every hour)│ │
↓ ↓ ↓
┌─────────────────────────────────────────────────────────────┐
│ Cloud Storage (S3 / Azure Blob / GCS) │
│ ┌──────────────────────────────────────────────────────┐ │
│ │ Versioning Enabled | Lifecycle Rules | Encryption │ │
│ │ Retention: 7d + 4w + 12m │ │
│ └──────────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────┐
│ WAL (Write-Ahead Logs) Archive │
│ Continuous archiving for Point-in-Time Recovery (PITR) │
│ Retention: 7 days │
└─────────────────────────────────────────────────────────────┘
```
### 1.2 Backup Objectives
**RTO (Recovery Time Objective):** 4 hours
- Maximum time to restore service after failure
**RPO (Recovery Point Objective):** 15 minutes
- Maximum data loss acceptable (via WAL archiving)
**Data Durability:** 99.999999999% (11 nines)
- Multi-region cloud storage replication
**Backup Types:**
1. **Full Backup:** Complete database snapshot (daily at 2:00 AM)
2. **Incremental Backup:** Changes since last full backup (every 4 hours)
3. **WAL Archive:** Continuous archiving for PITR (every 60 seconds)
4. **Per-Schema Backup:** Individual schema backups for multi-tenant isolation (daily)
---
## 2. BACKUP STRATEGY
### 2.1 Backup Schedule
| Backup Type | Frequency | Retention | Size (Est.) | Duration |
|-------------|-----------|-----------|-------------|----------|
| **Full Backup** | Daily (2:00 AM) | 7 days local + 12 months cloud | 50-100 GB | 30-45 min |
| **Incremental Backup** | Every 4 hours | 7 days local | 5-10 GB | 5-10 min |
| **WAL Archive** | Continuous (every 16 MB) | 7 days | 100-200 GB/week | Real-time |
| **Per-Schema Backup** | Daily (3:00 AM) | 7 days local + 4 weeks cloud | 5-15 GB each | 5-10 min each |
| **Config Backup** | On change + daily | 30 days | <100 MB | <1 min |
### 2.2 Backup Storage Locations
**Primary Storage (Local):**
- **Path:** `/backups/postgres/`
- **Filesystem:** XFS (optimized for large files)
- **Capacity:** 500 GB minimum
- **RAID:** RAID 10 (performance + redundancy)
**Secondary Storage (Cloud):**
- **Provider:** AWS S3 / Azure Blob Storage / Google Cloud Storage
- **Bucket:** `erp-generic-backups-prod`
- **Region:** Multi-region replication (e.g., us-east-1 + us-west-2)
- **Encryption:** AES-256 at rest
- **Versioning:** Enabled (30 versions max)
- **Lifecycle Rules:**
- Move to Glacier after 30 days
- Delete after 1 year (except annual backups)
**Tertiary Storage (Offsite):**
- **Type:** Tape backup / Cold storage
- **Frequency:** Monthly
- **Retention:** 7 years (compliance requirement)
### 2.3 Backup Validation
**Automated Validation (Daily):**
```bash
# 1. Verify backup file integrity
md5sum /backups/postgres/full_20251124_020000.dump > /backups/postgres/full_20251124_020000.dump.md5
# 2. Test restore to staging environment (weekly)
pg_restore --dbname=erp_generic_staging --clean --if-exists /backups/postgres/full_20251124_020000.dump
# 3. Run smoke tests on restored database
psql -d erp_generic_staging -c "SELECT COUNT(*) FROM auth.users;"
psql -d erp_generic_staging -c "SELECT COUNT(*) FROM core.partners;"
# 4. Compare row counts with production
diff <(psql -d erp_generic -tAc "SELECT tablename, n_live_tup FROM pg_stat_user_tables ORDER BY tablename") \
<(psql -d erp_generic_staging -tAc "SELECT tablename, n_live_tup FROM pg_stat_user_tables ORDER BY tablename")
```
**Manual Validation (Monthly):**
- Full disaster recovery drill
- Restore to isolated environment
- Verify business-critical data
- Test application functionality
- Document findings in post-mortem
---
## 3. BACKUP SCRIPTS
### 3.1 Full Backup Script
**File:** `scripts/backup-postgres.sh`
```bash
#!/bin/bash
# =====================================================
# ERP GENERIC - PostgreSQL Full Backup Script
# Performs full database backup with multi-schema support
# =====================================================
set -euo pipefail
# Configuration
BACKUP_DIR="/backups/postgres"
TIMESTAMP=$(date +"%Y%m%d_%H%M%S")
RETENTION_DAYS=7
DB_HOST="${POSTGRES_HOST:-postgres}"
DB_PORT="${POSTGRES_PORT:-5432}"
DB_NAME="${POSTGRES_DB:-erp_generic}"
DB_USER="${POSTGRES_USER:-erp_user}"
PGPASSWORD="${POSTGRES_PASSWORD}"
export PGPASSWORD
# Logging
LOG_FILE="/var/log/erp-generic/backup.log"
exec > >(tee -a "$LOG_FILE")
exec 2>&1
echo "===== PostgreSQL Backup Started at $(date) ====="
# Create backup directory if not exists
mkdir -p "$BACKUP_DIR"
# 1. Full Database Backup
echo "1. Creating full database backup..."
FULL_BACKUP_FILE="${BACKUP_DIR}/full_${TIMESTAMP}.dump"
pg_dump -h "$DB_HOST" -p "$DB_PORT" -U "$DB_USER" -Fc -v -d "$DB_NAME" -f "$FULL_BACKUP_FILE"
if [ $? -eq 0 ]; then
echo "✓ Full backup created: $FULL_BACKUP_FILE"
FILE_SIZE=$(du -h "$FULL_BACKUP_FILE" | cut -f1)
echo " Size: $FILE_SIZE"
else
echo "✗ Full backup failed!"
exit 1
fi
# 2. Generate MD5 checksum
echo "2. Generating checksum..."
md5sum "$FULL_BACKUP_FILE" > "${FULL_BACKUP_FILE}.md5"
echo "✓ Checksum saved: ${FULL_BACKUP_FILE}.md5"
# 3. Per-Schema Backups (Multi-Tenant Isolation)
echo "3. Creating per-schema backups..."
SCHEMAS=("auth" "core" "financial" "inventory" "purchase" "sales" "analytics" "projects" "system")
for schema in "${SCHEMAS[@]}"; do
SCHEMA_BACKUP_FILE="${BACKUP_DIR}/${schema}_${TIMESTAMP}.dump"
echo " Backing up schema: $schema"
pg_dump -h "$DB_HOST" -p "$DB_PORT" -U "$DB_USER" -Fc -n "$schema" -d "$DB_NAME" -f "$SCHEMA_BACKUP_FILE"
if [ $? -eq 0 ]; then
SCHEMA_SIZE=$(du -h "$SCHEMA_BACKUP_FILE" | cut -f1)
echo " ✓ $schema backup created ($SCHEMA_SIZE)"
else
echo " ✗ $schema backup failed!"
fi
done
# 4. Backup Database Roles and Permissions
echo "4. Backing up database roles..."
ROLES_BACKUP_FILE="${BACKUP_DIR}/roles_${TIMESTAMP}.sql"
pg_dumpall -h "$DB_HOST" -p "$DB_PORT" -U "$DB_USER" --roles-only -f "$ROLES_BACKUP_FILE"
echo "✓ Roles backup created: $ROLES_BACKUP_FILE"
# 5. Backup PostgreSQL Configuration
echo "5. Backing up PostgreSQL configuration..."
CONFIG_BACKUP_DIR="${BACKUP_DIR}/config_${TIMESTAMP}"
mkdir -p "$CONFIG_BACKUP_DIR"
# Copy config files (if accessible)
if [ -f /etc/postgresql/16/main/postgresql.conf ]; then
cp /etc/postgresql/16/main/postgresql.conf "$CONFIG_BACKUP_DIR/"
cp /etc/postgresql/16/main/pg_hba.conf "$CONFIG_BACKUP_DIR/"
echo "✓ Config files backed up"
fi
# 6. Backup WAL Archive Status
echo "6. Recording WAL archive status..."
psql -h "$DB_HOST" -p "$DB_PORT" -U "$DB_USER" -d "$DB_NAME" -tAc "SELECT * FROM pg_stat_archiver;" > "${BACKUP_DIR}/wal_status_${TIMESTAMP}.txt"
# 7. Upload to Cloud Storage (S3)
if [ -n "${AWS_S3_BUCKET:-}" ]; then
echo "7. Uploading to S3..."
aws s3 cp "$FULL_BACKUP_FILE" "s3://${AWS_S3_BUCKET}/postgres/${TIMESTAMP}/" --storage-class STANDARD_IA
aws s3 cp "${FULL_BACKUP_FILE}.md5" "s3://${AWS_S3_BUCKET}/postgres/${TIMESTAMP}/"
# Upload per-schema backups
for schema in "${SCHEMAS[@]}"; do
SCHEMA_BACKUP_FILE="${BACKUP_DIR}/${schema}_${TIMESTAMP}.dump"
if [ -f "$SCHEMA_BACKUP_FILE" ]; then
aws s3 cp "$SCHEMA_BACKUP_FILE" "s3://${AWS_S3_BUCKET}/postgres/${TIMESTAMP}/schemas/" --storage-class STANDARD_IA
fi
done
echo "✓ Backup uploaded to S3"
fi
# 8. Cleanup Old Backups (Local)
echo "8. Cleaning up old backups (older than $RETENTION_DAYS days)..."
find "$BACKUP_DIR" -type f -name "*.dump" -mtime +$RETENTION_DAYS -delete
find "$BACKUP_DIR" -type f -name "*.sql" -mtime +$RETENTION_DAYS -delete
find "$BACKUP_DIR" -type f -name "*.md5" -mtime +$RETENTION_DAYS -delete
find "$BACKUP_DIR" -type d -name "config_*" -mtime +$RETENTION_DAYS -exec rm -rf {} + 2>/dev/null || true
echo "✓ Old backups cleaned up"
# 9. Verify Backup Integrity
echo "9. Verifying backup integrity..."
md5sum -c "${FULL_BACKUP_FILE}.md5"
if [ $? -eq 0 ]; then
echo "✓ Backup integrity verified"
else
echo "✗ Backup integrity check failed!"
exit 1
fi
# 10. Send Notification
echo "10. Sending backup notification..."
BACKUP_SIZE=$(du -sh "$BACKUP_DIR" | cut -f1)
# Slack notification (optional)
if [ -n "${SLACK_WEBHOOK_URL:-}" ]; then
curl -X POST "$SLACK_WEBHOOK_URL" \
-H 'Content-Type: application/json' \
-d "{\"text\": \"✅ PostgreSQL backup completed successfully\n• Database: $DB_NAME\n• Size: $FILE_SIZE\n• Timestamp: $TIMESTAMP\n• Total backup dir size: $BACKUP_SIZE\"}"
fi
echo "===== PostgreSQL Backup Completed at $(date) ====="
echo "Total backup size: $BACKUP_SIZE"
echo "Backup location: $BACKUP_DIR"
# Exit successfully
exit 0
```
### 3.2 Incremental Backup Script
**File:** `scripts/backup-postgres-incremental.sh`
```bash
#!/bin/bash
# =====================================================
# ERP GENERIC - PostgreSQL Incremental Backup Script
# Uses pg_dump with --snapshot for consistency
# =====================================================
set -euo pipefail
BACKUP_DIR="/backups/postgres/incremental"
TIMESTAMP=$(date +"%Y%m%d_%H%M%S")
DB_HOST="${POSTGRES_HOST:-postgres}"
DB_NAME="${POSTGRES_DB:-erp_generic}"
DB_USER="${POSTGRES_USER:-erp_user}"
PGPASSWORD="${POSTGRES_PASSWORD}"
export PGPASSWORD
mkdir -p "$BACKUP_DIR"
echo "===== Incremental Backup Started at $(date) ====="
# Get last full backup timestamp
LAST_FULL_BACKUP=$(ls -t /backups/postgres/full_*.dump | head -1 | grep -oP '\d{8}_\d{6}')
echo "Last full backup: $LAST_FULL_BACKUP"
# Incremental backup: Export only changed data (simplified approach)
# In production, consider using WAL-based incremental backups or pg_basebackup
INCREMENTAL_FILE="${BACKUP_DIR}/incremental_${TIMESTAMP}.dump"
pg_dump -h "$DB_HOST" -U "$DB_USER" -Fc -d "$DB_NAME" -f "$INCREMENTAL_FILE"
echo "✓ Incremental backup created: $INCREMENTAL_FILE"
echo "===== Incremental Backup Completed at $(date) ====="
```
### 3.3 WAL Archiving Configuration
**File:** `postgresql.conf` (WAL archiving section)
```ini
# WAL Settings for Point-in-Time Recovery
wal_level = replica
archive_mode = on
archive_command = 'test ! -f /backups/wal/%f && cp %p /backups/wal/%f && aws s3 cp /backups/wal/%f s3://${AWS_S3_BUCKET}/wal/'
archive_timeout = 60s
max_wal_senders = 3
wal_keep_size = 1GB
```
### 3.4 Cron Schedule
**File:** `/etc/cron.d/erp-backup`
```cron
# ERP Generic Backup Schedule
# Full backup daily at 2:00 AM
0 2 * * * root /opt/erp-generic/scripts/backup-postgres.sh >> /var/log/erp-generic/backup.log 2>&1
# Incremental backup every 4 hours
0 */4 * * * root /opt/erp-generic/scripts/backup-postgres-incremental.sh >> /var/log/erp-generic/backup.log 2>&1
# Verify backups daily at 4:00 AM
0 4 * * * root /opt/erp-generic/scripts/verify-backup.sh >> /var/log/erp-generic/backup.log 2>&1
# Cleanup old WAL files daily at 5:00 AM
0 5 * * * root find /backups/wal -type f -mtime +7 -delete
# Weekly full disaster recovery test (Sundays at 3:00 AM)
0 3 * * 0 root /opt/erp-generic/scripts/test-restore.sh >> /var/log/erp-generic/backup-test.log 2>&1
```
---
## 4. MULTI-TENANT BACKUP ISOLATION
### 4.1 Per-Tenant Backup Strategy
**Why Per-Schema Backups?**
- Restore individual tenant without affecting others
- Compliance: GDPR right to erasure (delete tenant data)
- Tenant migration to dedicated instance
- Faster restore times for single tenant issues
**Backup Structure:**
```
/backups/postgres/
├── full_20251124_020000.dump # All schemas
├── auth_20251124_020000.dump # Auth schema only
├── core_20251124_020000.dump # Core schema only
├── financial_20251124_020000.dump # Financial schema only
├── inventory_20251124_020000.dump # Inventory schema only
├── purchase_20251124_020000.dump # Purchase schema only
├── sales_20251124_020000.dump # Sales schema only
├── analytics_20251124_020000.dump # Analytics schema only
├── projects_20251124_020000.dump # Projects schema only
└── system_20251124_020000.dump # System schema only
```
### 4.2 Restore Single Tenant
```bash
#!/bin/bash
# Restore single tenant (schema isolation)
TENANT_ID="tenant-abc"
SCHEMA_NAME="financial" # Example: restore financial schema only
BACKUP_FILE="/backups/postgres/financial_20251124_020000.dump"
echo "Restoring schema $SCHEMA_NAME for tenant $TENANT_ID..."
# Option 1: Drop and restore entire schema
psql -h postgres -U erp_user -d erp_generic -c "DROP SCHEMA IF EXISTS $SCHEMA_NAME CASCADE;"
pg_restore -h postgres -U erp_user -d erp_generic -n $SCHEMA_NAME --clean --if-exists "$BACKUP_FILE"
# Option 2: Restore to temporary schema, then copy tenant-specific data
pg_restore -h postgres -U erp_user -d erp_generic -n ${SCHEMA_NAME}_temp --create "$BACKUP_FILE"
# Copy tenant-specific data
psql -h postgres -U erp_user -d erp_generic <<SQL
-- Copy tenant data from temp schema
INSERT INTO $SCHEMA_NAME.accounts
SELECT * FROM ${SCHEMA_NAME}_temp.accounts WHERE tenant_id = '$TENANT_ID';
-- Drop temp schema
DROP SCHEMA ${SCHEMA_NAME}_temp CASCADE;
SQL
echo "✓ Tenant $TENANT_ID restored from $SCHEMA_NAME schema"
```
---
## 5. RETENTION POLICY
### 5.1 Retention Rules
**Local Storage (Fast Recovery):**
- **Full Backups:** 7 days
- **Incremental Backups:** 7 days
- **WAL Archives:** 7 days
- **Config Backups:** 30 days
**Cloud Storage (Long-term Retention):**
- **Daily Backups:** 30 days (Standard storage)
- **Weekly Backups:** 12 weeks (Standard-IA storage)
- **Monthly Backups:** 12 months (Glacier storage)
- **Annual Backups:** 7 years (Glacier Deep Archive) - Compliance
**Lifecycle Policy (AWS S3 Example):**
```json
{
"Rules": [
{
"Id": "TransitionToIA",
"Status": "Enabled",
"Transitions": [
{
"Days": 30,
"StorageClass": "STANDARD_IA"
}
]
},
{
"Id": "TransitionToGlacier",
"Status": "Enabled",
"Transitions": [
{
"Days": 90,
"StorageClass": "GLACIER"
}
]
},
{
"Id": "DeleteOldBackups",
"Status": "Enabled",
"Expiration": {
"Days": 365
},
"Filter": {
"Prefix": "postgres/daily/"
}
}
]
}
```
### 5.2 Compliance Requirements
**GDPR (EU):**
- Right to erasure: Ability to delete tenant data
- Backup encryption: AES-256
- Access logging: Who accessed what backup
**SOC 2:**
- Backup testing: Monthly restore drills
- Access controls: Role-based access to backups
- Audit trail: Log all backup/restore operations
**Industry-Specific:**
- Healthcare (HIPAA): 6 years retention
- Financial (SOX): 7 years retention
- Government: Variable (check local regulations)
---
## 6. RECOVERY PROCEDURES
### 6.1 Full Database Restore
**File:** `scripts/restore-postgres.sh`
```bash
#!/bin/bash
# =====================================================
# ERP GENERIC - PostgreSQL Restore Script
# Restores database from backup file
# =====================================================
set -euo pipefail
# Usage
if [ $# -lt 1 ]; then
echo "Usage: $0 <backup_file> [--target=<database>] [--no-prompt]"
echo "Example: $0 /backups/postgres/full_20251124_020000.dump --target=erp_generic_staging"
exit 1
fi
BACKUP_FILE="$1"
TARGET_DB="${2:-erp_generic}"
NO_PROMPT=false
# Parse arguments
for arg in "$@"; do
case $arg in
--target=*)
TARGET_DB="${arg#*=}"
shift
;;
--no-prompt)
NO_PROMPT=true
shift
;;
esac
done
DB_HOST="${POSTGRES_HOST:-postgres}"
DB_USER="${POSTGRES_USER:-erp_user}"
PGPASSWORD="${POSTGRES_PASSWORD}"
export PGPASSWORD
echo "===== PostgreSQL Restore Started at $(date) ====="
echo "Backup file: $BACKUP_FILE"
echo "Target database: $TARGET_DB"
# Verify backup file exists
if [ ! -f "$BACKUP_FILE" ]; then
echo "✗ Backup file not found: $BACKUP_FILE"
exit 1
fi
# Verify checksum if exists
if [ -f "${BACKUP_FILE}.md5" ]; then
echo "Verifying backup integrity..."
md5sum -c "${BACKUP_FILE}.md5"
if [ $? -ne 0 ]; then
echo "✗ Backup integrity check failed!"
exit 1
fi
echo "✓ Backup integrity verified"
fi
# Safety prompt (unless --no-prompt)
if [ "$NO_PROMPT" = false ]; then
echo ""
echo "⚠️ WARNING: This will OVERWRITE all data in database '$TARGET_DB'"
echo "⚠️ Make sure you have a recent backup before proceeding!"
echo ""
read -p "Are you sure you want to continue? (yes/no): " -r
if [[ ! $REPLY =~ ^[Yy][Ee][Ss]$ ]]; then
echo "Restore cancelled by user."
exit 0
fi
fi
# Create safety backup of target database (if not staging/dev)
if [[ ! "$TARGET_DB" =~ (staging|dev|test) ]]; then
echo "Creating safety backup of $TARGET_DB before restore..."
SAFETY_BACKUP="/backups/postgres/safety_${TARGET_DB}_$(date +%Y%m%d_%H%M%S).dump"
pg_dump -h "$DB_HOST" -U "$DB_USER" -Fc -d "$TARGET_DB" -f "$SAFETY_BACKUP"
echo "✓ Safety backup created: $SAFETY_BACKUP"
fi
# Terminate active connections to target database
echo "Terminating active connections to $TARGET_DB..."
psql -h "$DB_HOST" -U "$DB_USER" -d postgres <<SQL
SELECT pg_terminate_backend(pid)
FROM pg_stat_activity
WHERE datname = '$TARGET_DB' AND pid <> pg_backend_pid();
SQL
# Restore database
echo "Restoring database from $BACKUP_FILE..."
pg_restore -h "$DB_HOST" -U "$DB_USER" -d "$TARGET_DB" --clean --if-exists --verbose "$BACKUP_FILE"
if [ $? -eq 0 ]; then
echo "✓ Database restored successfully"
else
echo "✗ Restore failed!"
exit 1
fi
# Verify restore
echo "Verifying restore..."
TABLE_COUNT=$(psql -h "$DB_HOST" -U "$DB_USER" -d "$TARGET_DB" -tAc "SELECT COUNT(*) FROM information_schema.tables WHERE table_schema NOT IN ('pg_catalog', 'information_schema');")
USER_COUNT=$(psql -h "$DB_HOST" -U "$DB_USER" -d "$TARGET_DB" -tAc "SELECT COUNT(*) FROM auth.users;" 2>/dev/null || echo "N/A")
echo "Tables restored: $TABLE_COUNT"
echo "Users in auth.users: $USER_COUNT"
# Rebuild statistics
echo "Rebuilding database statistics..."
psql -h "$DB_HOST" -U "$DB_USER" -d "$TARGET_DB" -c "ANALYZE;"
echo "✓ Statistics rebuilt"
echo "===== PostgreSQL Restore Completed at $(date) ====="
echo ""
echo "Next steps:"
echo "1. Verify application functionality"
echo "2. Run smoke tests: npm run test:smoke"
echo "3. Check logs for errors"
echo "4. Notify team that restore is complete"
exit 0
```
### 6.2 Schema-Level Restore
```bash
# Restore single schema
SCHEMA="financial"
BACKUP_FILE="/backups/postgres/financial_20251124_020000.dump"
pg_restore -h postgres -U erp_user -d erp_generic -n $SCHEMA --clean --if-exists "$BACKUP_FILE"
```
### 6.3 Table-Level Restore
```bash
# Restore single table
TABLE="auth.users"
BACKUP_FILE="/backups/postgres/full_20251124_020000.dump"
pg_restore -h postgres -U erp_user -d erp_generic -t $TABLE --data-only "$BACKUP_FILE"
```
---
## 7. POINT-IN-TIME RECOVERY (PITR)
### 7.1 PITR Process
**Use Case:** Restore database to specific point in time (e.g., before accidental DELETE query)
**Requirements:**
- Base backup (full backup)
- WAL archives from base backup to target time
**Steps:**
```bash
#!/bin/bash
# Point-in-Time Recovery (PITR)
TARGET_TIME="2025-11-24 14:30:00"
BASE_BACKUP="/backups/postgres/full_20251124_020000.dump"
WAL_ARCHIVE_DIR="/backups/wal"
echo "===== Point-in-Time Recovery to $TARGET_TIME ====="
# 1. Stop PostgreSQL
docker-compose stop postgres
# 2. Backup current data (safety)
mv /data/postgres /data/postgres_backup_$(date +%Y%m%d_%H%M%S)
# 3. Restore base backup
mkdir -p /data/postgres
pg_basebackup -h postgres -U erp_user -D /data/postgres -Fp -Xs -P
# 4. Create recovery configuration
cat > /data/postgres/recovery.conf <<EOF
restore_command = 'cp /backups/wal/%f %p'
recovery_target_time = '$TARGET_TIME'
recovery_target_action = 'promote'
EOF
# 5. Start PostgreSQL (will enter recovery mode)
docker-compose start postgres
# 6. Monitor recovery
tail -f /var/log/postgresql/postgresql-*.log | grep -i recovery
echo "Recovery complete! Database restored to $TARGET_TIME"
```
### 7.2 PITR Best Practices
- **Test PITR quarterly** in non-production environment
- **Document recovery times** (usually 30 min - 2 hours depending on WAL size)
- **Automate WAL archiving** to S3/cloud storage
- **Monitor WAL archive lag** (should be <5 minutes)
---
## 8. DISASTER RECOVERY PLAYBOOK
### 8.1 Disaster Scenarios
**Scenario 1: Complete Data Center Failure**
**Symptoms:**
- All servers unreachable
- Network connectivity lost
- Physical infrastructure damaged
**Recovery Steps:**
```bash
# 1. Provision new infrastructure (AWS, Azure, GCP)
terraform apply -var-file=disaster-recovery.tfvars
# 2. Download latest backup from cloud storage
aws s3 sync s3://erp-generic-backups-prod/postgres/ /backups/postgres/ --exclude "*" --include "full_*"
# 3. Deploy Docker containers
docker-compose -f docker-compose.prod.yml up -d postgres redis
# 4. Restore database
LATEST_BACKUP=$(ls -t /backups/postgres/full_*.dump | head -1)
./scripts/restore-postgres.sh "$LATEST_BACKUP" --no-prompt
# 5. Restore WAL archives for PITR
aws s3 sync s3://erp-generic-backups-prod/wal/ /backups/wal/
# 6. Deploy application
docker-compose -f docker-compose.prod.yml up -d backend frontend nginx
# 7. Update DNS (point to new infrastructure)
# Manual step or use Route53/CloudFlare API
# 8. Verify functionality
./scripts/health-check.sh
npm run test:smoke
# 9. Notify stakeholders
# Send email/Slack notification
```
**Estimated RTO:** 4 hours
**Estimated RPO:** 15 minutes (last WAL archive)
---
**Scenario 2: Accidental Data Deletion**
**Symptoms:**
- Critical data deleted by user error
- "DELETE FROM sales.orders WHERE..." without WHERE clause
**Recovery Steps:**
```bash
# 1. Identify deletion time
SELECT * FROM system.audit_logs WHERE event = 'DELETE' AND table_name = 'sales.orders' ORDER BY created_at DESC LIMIT 10;
# 2. Use PITR to restore to before deletion
./scripts/pitr-restore.sh --target-time="2025-11-24 14:25:00"
# 3. Export deleted records from restored database
psql -h restored-db -U erp_user -d erp_generic -c "COPY (SELECT * FROM sales.orders WHERE created_at >= '2025-11-24 14:00:00') TO '/tmp/recovered_orders.csv' CSV HEADER;"
# 4. Import recovered records to production
psql -h postgres -U erp_user -d erp_generic -c "\COPY sales.orders FROM '/tmp/recovered_orders.csv' CSV HEADER;"
echo "✓ Deleted records recovered"
```
**Estimated RTO:** 1 hour
**Estimated RPO:** 0 (no data loss if caught quickly)
---
**Scenario 3: Database Corruption**
**Symptoms:**
- PostgreSQL fails to start
- "corrupt page" errors in logs
- Data inconsistencies
**Recovery Steps:**
```bash
# 1. Attempt automatic repair
docker-compose exec postgres pg_resetwal /var/lib/postgresql/data
# 2. If repair fails, restore from backup
./scripts/restore-postgres.sh /backups/postgres/full_20251124_020000.dump
# 3. Run VACUUM and ANALYZE
psql -h postgres -U erp_user -d erp_generic -c "VACUUM FULL; ANALYZE;"
# 4. Rebuild indexes
psql -h postgres -U erp_user -d erp_generic -c "REINDEX DATABASE erp_generic;"
```
---
## 9. BACKUP TESTING
### 9.1 Monthly Restore Test
**File:** `scripts/test-restore.sh`
```bash
#!/bin/bash
# Monthly backup restore test
BACKUP_FILE=$(ls -t /backups/postgres/full_*.dump | head -1)
TEST_DB="erp_generic_restore_test"
echo "===== Backup Restore Test ====="
echo "Backup: $BACKUP_FILE"
# 1. Drop test database if exists
psql -h postgres -U erp_user -d postgres -c "DROP DATABASE IF EXISTS $TEST_DB;"
# 2. Create test database
psql -h postgres -U erp_user -d postgres -c "CREATE DATABASE $TEST_DB;"
# 3. Restore backup
pg_restore -h postgres -U erp_user -d $TEST_DB --clean --if-exists "$BACKUP_FILE"
# 4. Run smoke tests
psql -h postgres -U erp_user -d $TEST_DB <<SQL
-- Verify table counts
SELECT COUNT(*) AS user_count FROM auth.users;
SELECT COUNT(*) AS partner_count FROM core.partners;
SELECT COUNT(*) AS order_count FROM sales.orders;
-- Verify data integrity
SELECT 'OK' AS status WHERE (
SELECT COUNT(*) FROM auth.users
) > 0;
SQL
# 5. Cleanup
psql -h postgres -U erp_user -d postgres -c "DROP DATABASE $TEST_DB;"
echo "✓ Backup restore test passed"
```
### 9.2 Quarterly DR Drill
**Checklist:**
- [ ] Provision new infrastructure (staging environment)
- [ ] Restore from cloud backup (S3)
- [ ] Verify all 9 schemas restored
- [ ] Run full test suite (unit + integration + E2E)
- [ ] Measure RTO (actual time to restore)
- [ ] Measure RPO (data loss amount)
- [ ] Document findings and improvements
- [ ] Update DR playbook
**Success Criteria:**
- RTO < 4 hours
- RPO < 15 minutes
- All tests passing
- Zero critical data loss
---
## 10. REFERENCES
**Internal Documentation:**
- [Deployment Guide](./DEPLOYMENT-GUIDE.md)
- [Monitoring & Observability](./MONITORING-OBSERVABILITY.md)
- [Database Schemas](../02-modelado/database-design/schemas/)
**External Resources:**
- [PostgreSQL Backup Documentation](https://www.postgresql.org/docs/16/backup.html)
- [PostgreSQL PITR](https://www.postgresql.org/docs/16/continuous-archiving.html)
- [AWS RDS Backup Best Practices](https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/USER_WorkingWithAutomatedBackups.html)
---
**Documento:** BACKUP-RECOVERY.md
**Versión:** 1.0
**Total Páginas:** ~12
**Última Actualización:** 2025-11-24