Sistema NEXUS v3.4 migrado con: Estructura principal: - core/orchestration: Sistema SIMCO + CAPVED (27 directivas, 28 perfiles) - core/catalog: Catalogo de funcionalidades reutilizables - shared/knowledge-base: Base de conocimiento compartida - devtools/scripts: Herramientas de desarrollo - control-plane/registries: Control de servicios y CI/CD - orchestration/: Configuracion de orquestacion de agentes Proyectos incluidos (11): - gamilit (submodule -> GitHub) - trading-platform (OrbiquanTIA) - erp-suite con 5 verticales: - erp-core, construccion, vidrio-templado - mecanicas-diesel, retail, clinicas - betting-analytics - inmobiliaria-analytics - platform_marketing_content - pos-micro, erp-basico Configuracion: - .gitignore completo para Node.js/Python/Docker - gamilit como submodule (git@github.com:rckrdmrd/gamilit-workspace.git) - Sistema de puertos estandarizado (3005-3199) Generated with NEXUS v3.4 Migration System EPIC-010: Configuracion Git y Repositorios
30 KiB
BACKUP & RECOVERY - ERP Generic
Última actualización: 2025-11-24 Responsable: DevOps Team / DBA Team Estado: ✅ Production-Ready
TABLE OF CONTENTS
- Overview
- Backup Strategy
- Backup Scripts
- Multi-Tenant Backup Isolation
- Retention Policy
- Recovery Procedures
- Point-in-Time Recovery (PITR)
- Disaster Recovery Playbook
- Backup Testing
- References
1. OVERVIEW
1.1 Backup Architecture
┌─────────────────────────────────────────────────────────────┐
│ PostgreSQL 16 Database │
│ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐ │
│ │ auth │ │ core │ │ financial│ │ inventory│ │
│ │ schema │ │ schema │ │ schema │ │ schema │ │
│ └────┬─────┘ └────┬─────┘ └────┬─────┘ └────┬─────┘ │
│ │ │ │ │ │
│ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐ │
│ │ purchase │ │ sales │ │ analytics│ │ projects │ │
│ │ schema │ │ schema │ │ schema │ │ schema │ │
│ └────┬─────┘ └────┬─────┘ └────┬─────┘ └────┬─────┘ │
│ │ │ │ │ │
│ ┌──────────┐ │
│ │ system │ │
│ │ schema │ │
│ └────┬─────┘ │
└───────┼───────────────────────────────────────────────────┘
│
↓ (Automated backup every 4 hours)
┌─────────────────────────────────────────────────────────────┐
│ Local Backup Storage (/backups) │
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │
│ │ Full Backup │ │ Incremental │ │ Per-Schema │ │
│ │ (Daily) │ │ (4 hours) │ │ Backups │ │
│ └──────┬───────┘ └──────┬───────┘ └──────┬───────┘ │
└─────────┼──────────────────┼──────────────────┼─────────────┘
│ │ │
│ (Sync every hour)│ │
↓ ↓ ↓
┌─────────────────────────────────────────────────────────────┐
│ Cloud Storage (S3 / Azure Blob / GCS) │
│ ┌──────────────────────────────────────────────────────┐ │
│ │ Versioning Enabled | Lifecycle Rules | Encryption │ │
│ │ Retention: 7d + 4w + 12m │ │
│ └──────────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────┐
│ WAL (Write-Ahead Logs) Archive │
│ Continuous archiving for Point-in-Time Recovery (PITR) │
│ Retention: 7 days │
└─────────────────────────────────────────────────────────────┘
1.2 Backup Objectives
RTO (Recovery Time Objective): 4 hours
- Maximum time to restore service after failure
RPO (Recovery Point Objective): 15 minutes
- Maximum data loss acceptable (via WAL archiving)
Data Durability: 99.999999999% (11 nines)
- Multi-region cloud storage replication
Backup Types:
- Full Backup: Complete database snapshot (daily at 2:00 AM)
- Incremental Backup: Changes since last full backup (every 4 hours)
- WAL Archive: Continuous archiving for PITR (every 60 seconds)
- Per-Schema Backup: Individual schema backups for multi-tenant isolation (daily)
2. BACKUP STRATEGY
2.1 Backup Schedule
| Backup Type | Frequency | Retention | Size (Est.) | Duration |
|---|---|---|---|---|
| Full Backup | Daily (2:00 AM) | 7 days local + 12 months cloud | 50-100 GB | 30-45 min |
| Incremental Backup | Every 4 hours | 7 days local | 5-10 GB | 5-10 min |
| WAL Archive | Continuous (every 16 MB) | 7 days | 100-200 GB/week | Real-time |
| Per-Schema Backup | Daily (3:00 AM) | 7 days local + 4 weeks cloud | 5-15 GB each | 5-10 min each |
| Config Backup | On change + daily | 30 days | <100 MB | <1 min |
2.2 Backup Storage Locations
Primary Storage (Local):
- Path:
/backups/postgres/ - Filesystem: XFS (optimized for large files)
- Capacity: 500 GB minimum
- RAID: RAID 10 (performance + redundancy)
Secondary Storage (Cloud):
- Provider: AWS S3 / Azure Blob Storage / Google Cloud Storage
- Bucket:
erp-generic-backups-prod - Region: Multi-region replication (e.g., us-east-1 + us-west-2)
- Encryption: AES-256 at rest
- Versioning: Enabled (30 versions max)
- Lifecycle Rules:
- Move to Glacier after 30 days
- Delete after 1 year (except annual backups)
Tertiary Storage (Offsite):
- Type: Tape backup / Cold storage
- Frequency: Monthly
- Retention: 7 years (compliance requirement)
2.3 Backup Validation
Automated Validation (Daily):
# 1. Verify backup file integrity
md5sum /backups/postgres/full_20251124_020000.dump > /backups/postgres/full_20251124_020000.dump.md5
# 2. Test restore to staging environment (weekly)
pg_restore --dbname=erp_generic_staging --clean --if-exists /backups/postgres/full_20251124_020000.dump
# 3. Run smoke tests on restored database
psql -d erp_generic_staging -c "SELECT COUNT(*) FROM auth.users;"
psql -d erp_generic_staging -c "SELECT COUNT(*) FROM core.partners;"
# 4. Compare row counts with production
diff <(psql -d erp_generic -tAc "SELECT tablename, n_live_tup FROM pg_stat_user_tables ORDER BY tablename") \
<(psql -d erp_generic_staging -tAc "SELECT tablename, n_live_tup FROM pg_stat_user_tables ORDER BY tablename")
Manual Validation (Monthly):
- Full disaster recovery drill
- Restore to isolated environment
- Verify business-critical data
- Test application functionality
- Document findings in post-mortem
3. BACKUP SCRIPTS
3.1 Full Backup Script
File: scripts/backup-postgres.sh
#!/bin/bash
# =====================================================
# ERP GENERIC - PostgreSQL Full Backup Script
# Performs full database backup with multi-schema support
# =====================================================
set -euo pipefail
# Configuration
BACKUP_DIR="/backups/postgres"
TIMESTAMP=$(date +"%Y%m%d_%H%M%S")
RETENTION_DAYS=7
DB_HOST="${POSTGRES_HOST:-postgres}"
DB_PORT="${POSTGRES_PORT:-5432}"
DB_NAME="${POSTGRES_DB:-erp_generic}"
DB_USER="${POSTGRES_USER:-erp_user}"
PGPASSWORD="${POSTGRES_PASSWORD}"
export PGPASSWORD
# Logging
LOG_FILE="/var/log/erp-generic/backup.log"
exec > >(tee -a "$LOG_FILE")
exec 2>&1
echo "===== PostgreSQL Backup Started at $(date) ====="
# Create backup directory if not exists
mkdir -p "$BACKUP_DIR"
# 1. Full Database Backup
echo "1. Creating full database backup..."
FULL_BACKUP_FILE="${BACKUP_DIR}/full_${TIMESTAMP}.dump"
pg_dump -h "$DB_HOST" -p "$DB_PORT" -U "$DB_USER" -Fc -v -d "$DB_NAME" -f "$FULL_BACKUP_FILE"
if [ $? -eq 0 ]; then
echo "✓ Full backup created: $FULL_BACKUP_FILE"
FILE_SIZE=$(du -h "$FULL_BACKUP_FILE" | cut -f1)
echo " Size: $FILE_SIZE"
else
echo "✗ Full backup failed!"
exit 1
fi
# 2. Generate MD5 checksum
echo "2. Generating checksum..."
md5sum "$FULL_BACKUP_FILE" > "${FULL_BACKUP_FILE}.md5"
echo "✓ Checksum saved: ${FULL_BACKUP_FILE}.md5"
# 3. Per-Schema Backups (Multi-Tenant Isolation)
echo "3. Creating per-schema backups..."
SCHEMAS=("auth" "core" "financial" "inventory" "purchase" "sales" "analytics" "projects" "system")
for schema in "${SCHEMAS[@]}"; do
SCHEMA_BACKUP_FILE="${BACKUP_DIR}/${schema}_${TIMESTAMP}.dump"
echo " Backing up schema: $schema"
pg_dump -h "$DB_HOST" -p "$DB_PORT" -U "$DB_USER" -Fc -n "$schema" -d "$DB_NAME" -f "$SCHEMA_BACKUP_FILE"
if [ $? -eq 0 ]; then
SCHEMA_SIZE=$(du -h "$SCHEMA_BACKUP_FILE" | cut -f1)
echo " ✓ $schema backup created ($SCHEMA_SIZE)"
else
echo " ✗ $schema backup failed!"
fi
done
# 4. Backup Database Roles and Permissions
echo "4. Backing up database roles..."
ROLES_BACKUP_FILE="${BACKUP_DIR}/roles_${TIMESTAMP}.sql"
pg_dumpall -h "$DB_HOST" -p "$DB_PORT" -U "$DB_USER" --roles-only -f "$ROLES_BACKUP_FILE"
echo "✓ Roles backup created: $ROLES_BACKUP_FILE"
# 5. Backup PostgreSQL Configuration
echo "5. Backing up PostgreSQL configuration..."
CONFIG_BACKUP_DIR="${BACKUP_DIR}/config_${TIMESTAMP}"
mkdir -p "$CONFIG_BACKUP_DIR"
# Copy config files (if accessible)
if [ -f /etc/postgresql/16/main/postgresql.conf ]; then
cp /etc/postgresql/16/main/postgresql.conf "$CONFIG_BACKUP_DIR/"
cp /etc/postgresql/16/main/pg_hba.conf "$CONFIG_BACKUP_DIR/"
echo "✓ Config files backed up"
fi
# 6. Backup WAL Archive Status
echo "6. Recording WAL archive status..."
psql -h "$DB_HOST" -p "$DB_PORT" -U "$DB_USER" -d "$DB_NAME" -tAc "SELECT * FROM pg_stat_archiver;" > "${BACKUP_DIR}/wal_status_${TIMESTAMP}.txt"
# 7. Upload to Cloud Storage (S3)
if [ -n "${AWS_S3_BUCKET:-}" ]; then
echo "7. Uploading to S3..."
aws s3 cp "$FULL_BACKUP_FILE" "s3://${AWS_S3_BUCKET}/postgres/${TIMESTAMP}/" --storage-class STANDARD_IA
aws s3 cp "${FULL_BACKUP_FILE}.md5" "s3://${AWS_S3_BUCKET}/postgres/${TIMESTAMP}/"
# Upload per-schema backups
for schema in "${SCHEMAS[@]}"; do
SCHEMA_BACKUP_FILE="${BACKUP_DIR}/${schema}_${TIMESTAMP}.dump"
if [ -f "$SCHEMA_BACKUP_FILE" ]; then
aws s3 cp "$SCHEMA_BACKUP_FILE" "s3://${AWS_S3_BUCKET}/postgres/${TIMESTAMP}/schemas/" --storage-class STANDARD_IA
fi
done
echo "✓ Backup uploaded to S3"
fi
# 8. Cleanup Old Backups (Local)
echo "8. Cleaning up old backups (older than $RETENTION_DAYS days)..."
find "$BACKUP_DIR" -type f -name "*.dump" -mtime +$RETENTION_DAYS -delete
find "$BACKUP_DIR" -type f -name "*.sql" -mtime +$RETENTION_DAYS -delete
find "$BACKUP_DIR" -type f -name "*.md5" -mtime +$RETENTION_DAYS -delete
find "$BACKUP_DIR" -type d -name "config_*" -mtime +$RETENTION_DAYS -exec rm -rf {} + 2>/dev/null || true
echo "✓ Old backups cleaned up"
# 9. Verify Backup Integrity
echo "9. Verifying backup integrity..."
md5sum -c "${FULL_BACKUP_FILE}.md5"
if [ $? -eq 0 ]; then
echo "✓ Backup integrity verified"
else
echo "✗ Backup integrity check failed!"
exit 1
fi
# 10. Send Notification
echo "10. Sending backup notification..."
BACKUP_SIZE=$(du -sh "$BACKUP_DIR" | cut -f1)
# Slack notification (optional)
if [ -n "${SLACK_WEBHOOK_URL:-}" ]; then
curl -X POST "$SLACK_WEBHOOK_URL" \
-H 'Content-Type: application/json' \
-d "{\"text\": \"✅ PostgreSQL backup completed successfully\n• Database: $DB_NAME\n• Size: $FILE_SIZE\n• Timestamp: $TIMESTAMP\n• Total backup dir size: $BACKUP_SIZE\"}"
fi
echo "===== PostgreSQL Backup Completed at $(date) ====="
echo "Total backup size: $BACKUP_SIZE"
echo "Backup location: $BACKUP_DIR"
# Exit successfully
exit 0
3.2 Incremental Backup Script
File: scripts/backup-postgres-incremental.sh
#!/bin/bash
# =====================================================
# ERP GENERIC - PostgreSQL Incremental Backup Script
# Uses pg_dump with --snapshot for consistency
# =====================================================
set -euo pipefail
BACKUP_DIR="/backups/postgres/incremental"
TIMESTAMP=$(date +"%Y%m%d_%H%M%S")
DB_HOST="${POSTGRES_HOST:-postgres}"
DB_NAME="${POSTGRES_DB:-erp_generic}"
DB_USER="${POSTGRES_USER:-erp_user}"
PGPASSWORD="${POSTGRES_PASSWORD}"
export PGPASSWORD
mkdir -p "$BACKUP_DIR"
echo "===== Incremental Backup Started at $(date) ====="
# Get last full backup timestamp
LAST_FULL_BACKUP=$(ls -t /backups/postgres/full_*.dump | head -1 | grep -oP '\d{8}_\d{6}')
echo "Last full backup: $LAST_FULL_BACKUP"
# Incremental backup: Export only changed data (simplified approach)
# In production, consider using WAL-based incremental backups or pg_basebackup
INCREMENTAL_FILE="${BACKUP_DIR}/incremental_${TIMESTAMP}.dump"
pg_dump -h "$DB_HOST" -U "$DB_USER" -Fc -d "$DB_NAME" -f "$INCREMENTAL_FILE"
echo "✓ Incremental backup created: $INCREMENTAL_FILE"
echo "===== Incremental Backup Completed at $(date) ====="
3.3 WAL Archiving Configuration
File: postgresql.conf (WAL archiving section)
# WAL Settings for Point-in-Time Recovery
wal_level = replica
archive_mode = on
archive_command = 'test ! -f /backups/wal/%f && cp %p /backups/wal/%f && aws s3 cp /backups/wal/%f s3://${AWS_S3_BUCKET}/wal/'
archive_timeout = 60s
max_wal_senders = 3
wal_keep_size = 1GB
3.4 Cron Schedule
File: /etc/cron.d/erp-backup
# ERP Generic Backup Schedule
# Full backup daily at 2:00 AM
0 2 * * * root /opt/erp-generic/scripts/backup-postgres.sh >> /var/log/erp-generic/backup.log 2>&1
# Incremental backup every 4 hours
0 */4 * * * root /opt/erp-generic/scripts/backup-postgres-incremental.sh >> /var/log/erp-generic/backup.log 2>&1
# Verify backups daily at 4:00 AM
0 4 * * * root /opt/erp-generic/scripts/verify-backup.sh >> /var/log/erp-generic/backup.log 2>&1
# Cleanup old WAL files daily at 5:00 AM
0 5 * * * root find /backups/wal -type f -mtime +7 -delete
# Weekly full disaster recovery test (Sundays at 3:00 AM)
0 3 * * 0 root /opt/erp-generic/scripts/test-restore.sh >> /var/log/erp-generic/backup-test.log 2>&1
4. MULTI-TENANT BACKUP ISOLATION
4.1 Per-Tenant Backup Strategy
Why Per-Schema Backups?
- Restore individual tenant without affecting others
- Compliance: GDPR right to erasure (delete tenant data)
- Tenant migration to dedicated instance
- Faster restore times for single tenant issues
Backup Structure:
/backups/postgres/
├── full_20251124_020000.dump # All schemas
├── auth_20251124_020000.dump # Auth schema only
├── core_20251124_020000.dump # Core schema only
├── financial_20251124_020000.dump # Financial schema only
├── inventory_20251124_020000.dump # Inventory schema only
├── purchase_20251124_020000.dump # Purchase schema only
├── sales_20251124_020000.dump # Sales schema only
├── analytics_20251124_020000.dump # Analytics schema only
├── projects_20251124_020000.dump # Projects schema only
└── system_20251124_020000.dump # System schema only
4.2 Restore Single Tenant
#!/bin/bash
# Restore single tenant (schema isolation)
TENANT_ID="tenant-abc"
SCHEMA_NAME="financial" # Example: restore financial schema only
BACKUP_FILE="/backups/postgres/financial_20251124_020000.dump"
echo "Restoring schema $SCHEMA_NAME for tenant $TENANT_ID..."
# Option 1: Drop and restore entire schema
psql -h postgres -U erp_user -d erp_generic -c "DROP SCHEMA IF EXISTS $SCHEMA_NAME CASCADE;"
pg_restore -h postgres -U erp_user -d erp_generic -n $SCHEMA_NAME --clean --if-exists "$BACKUP_FILE"
# Option 2: Restore to temporary schema, then copy tenant-specific data
pg_restore -h postgres -U erp_user -d erp_generic -n ${SCHEMA_NAME}_temp --create "$BACKUP_FILE"
# Copy tenant-specific data
psql -h postgres -U erp_user -d erp_generic <<SQL
-- Copy tenant data from temp schema
INSERT INTO $SCHEMA_NAME.accounts
SELECT * FROM ${SCHEMA_NAME}_temp.accounts WHERE tenant_id = '$TENANT_ID';
-- Drop temp schema
DROP SCHEMA ${SCHEMA_NAME}_temp CASCADE;
SQL
echo "✓ Tenant $TENANT_ID restored from $SCHEMA_NAME schema"
5. RETENTION POLICY
5.1 Retention Rules
Local Storage (Fast Recovery):
- Full Backups: 7 days
- Incremental Backups: 7 days
- WAL Archives: 7 days
- Config Backups: 30 days
Cloud Storage (Long-term Retention):
- Daily Backups: 30 days (Standard storage)
- Weekly Backups: 12 weeks (Standard-IA storage)
- Monthly Backups: 12 months (Glacier storage)
- Annual Backups: 7 years (Glacier Deep Archive) - Compliance
Lifecycle Policy (AWS S3 Example):
{
"Rules": [
{
"Id": "TransitionToIA",
"Status": "Enabled",
"Transitions": [
{
"Days": 30,
"StorageClass": "STANDARD_IA"
}
]
},
{
"Id": "TransitionToGlacier",
"Status": "Enabled",
"Transitions": [
{
"Days": 90,
"StorageClass": "GLACIER"
}
]
},
{
"Id": "DeleteOldBackups",
"Status": "Enabled",
"Expiration": {
"Days": 365
},
"Filter": {
"Prefix": "postgres/daily/"
}
}
]
}
5.2 Compliance Requirements
GDPR (EU):
- Right to erasure: Ability to delete tenant data
- Backup encryption: AES-256
- Access logging: Who accessed what backup
SOC 2:
- Backup testing: Monthly restore drills
- Access controls: Role-based access to backups
- Audit trail: Log all backup/restore operations
Industry-Specific:
- Healthcare (HIPAA): 6 years retention
- Financial (SOX): 7 years retention
- Government: Variable (check local regulations)
6. RECOVERY PROCEDURES
6.1 Full Database Restore
File: scripts/restore-postgres.sh
#!/bin/bash
# =====================================================
# ERP GENERIC - PostgreSQL Restore Script
# Restores database from backup file
# =====================================================
set -euo pipefail
# Usage
if [ $# -lt 1 ]; then
echo "Usage: $0 <backup_file> [--target=<database>] [--no-prompt]"
echo "Example: $0 /backups/postgres/full_20251124_020000.dump --target=erp_generic_staging"
exit 1
fi
BACKUP_FILE="$1"
TARGET_DB="${2:-erp_generic}"
NO_PROMPT=false
# Parse arguments
for arg in "$@"; do
case $arg in
--target=*)
TARGET_DB="${arg#*=}"
shift
;;
--no-prompt)
NO_PROMPT=true
shift
;;
esac
done
DB_HOST="${POSTGRES_HOST:-postgres}"
DB_USER="${POSTGRES_USER:-erp_user}"
PGPASSWORD="${POSTGRES_PASSWORD}"
export PGPASSWORD
echo "===== PostgreSQL Restore Started at $(date) ====="
echo "Backup file: $BACKUP_FILE"
echo "Target database: $TARGET_DB"
# Verify backup file exists
if [ ! -f "$BACKUP_FILE" ]; then
echo "✗ Backup file not found: $BACKUP_FILE"
exit 1
fi
# Verify checksum if exists
if [ -f "${BACKUP_FILE}.md5" ]; then
echo "Verifying backup integrity..."
md5sum -c "${BACKUP_FILE}.md5"
if [ $? -ne 0 ]; then
echo "✗ Backup integrity check failed!"
exit 1
fi
echo "✓ Backup integrity verified"
fi
# Safety prompt (unless --no-prompt)
if [ "$NO_PROMPT" = false ]; then
echo ""
echo "⚠️ WARNING: This will OVERWRITE all data in database '$TARGET_DB'"
echo "⚠️ Make sure you have a recent backup before proceeding!"
echo ""
read -p "Are you sure you want to continue? (yes/no): " -r
if [[ ! $REPLY =~ ^[Yy][Ee][Ss]$ ]]; then
echo "Restore cancelled by user."
exit 0
fi
fi
# Create safety backup of target database (if not staging/dev)
if [[ ! "$TARGET_DB" =~ (staging|dev|test) ]]; then
echo "Creating safety backup of $TARGET_DB before restore..."
SAFETY_BACKUP="/backups/postgres/safety_${TARGET_DB}_$(date +%Y%m%d_%H%M%S).dump"
pg_dump -h "$DB_HOST" -U "$DB_USER" -Fc -d "$TARGET_DB" -f "$SAFETY_BACKUP"
echo "✓ Safety backup created: $SAFETY_BACKUP"
fi
# Terminate active connections to target database
echo "Terminating active connections to $TARGET_DB..."
psql -h "$DB_HOST" -U "$DB_USER" -d postgres <<SQL
SELECT pg_terminate_backend(pid)
FROM pg_stat_activity
WHERE datname = '$TARGET_DB' AND pid <> pg_backend_pid();
SQL
# Restore database
echo "Restoring database from $BACKUP_FILE..."
pg_restore -h "$DB_HOST" -U "$DB_USER" -d "$TARGET_DB" --clean --if-exists --verbose "$BACKUP_FILE"
if [ $? -eq 0 ]; then
echo "✓ Database restored successfully"
else
echo "✗ Restore failed!"
exit 1
fi
# Verify restore
echo "Verifying restore..."
TABLE_COUNT=$(psql -h "$DB_HOST" -U "$DB_USER" -d "$TARGET_DB" -tAc "SELECT COUNT(*) FROM information_schema.tables WHERE table_schema NOT IN ('pg_catalog', 'information_schema');")
USER_COUNT=$(psql -h "$DB_HOST" -U "$DB_USER" -d "$TARGET_DB" -tAc "SELECT COUNT(*) FROM auth.users;" 2>/dev/null || echo "N/A")
echo "Tables restored: $TABLE_COUNT"
echo "Users in auth.users: $USER_COUNT"
# Rebuild statistics
echo "Rebuilding database statistics..."
psql -h "$DB_HOST" -U "$DB_USER" -d "$TARGET_DB" -c "ANALYZE;"
echo "✓ Statistics rebuilt"
echo "===== PostgreSQL Restore Completed at $(date) ====="
echo ""
echo "Next steps:"
echo "1. Verify application functionality"
echo "2. Run smoke tests: npm run test:smoke"
echo "3. Check logs for errors"
echo "4. Notify team that restore is complete"
exit 0
6.2 Schema-Level Restore
# Restore single schema
SCHEMA="financial"
BACKUP_FILE="/backups/postgres/financial_20251124_020000.dump"
pg_restore -h postgres -U erp_user -d erp_generic -n $SCHEMA --clean --if-exists "$BACKUP_FILE"
6.3 Table-Level Restore
# Restore single table
TABLE="auth.users"
BACKUP_FILE="/backups/postgres/full_20251124_020000.dump"
pg_restore -h postgres -U erp_user -d erp_generic -t $TABLE --data-only "$BACKUP_FILE"
7. POINT-IN-TIME RECOVERY (PITR)
7.1 PITR Process
Use Case: Restore database to specific point in time (e.g., before accidental DELETE query)
Requirements:
- Base backup (full backup)
- WAL archives from base backup to target time
Steps:
#!/bin/bash
# Point-in-Time Recovery (PITR)
TARGET_TIME="2025-11-24 14:30:00"
BASE_BACKUP="/backups/postgres/full_20251124_020000.dump"
WAL_ARCHIVE_DIR="/backups/wal"
echo "===== Point-in-Time Recovery to $TARGET_TIME ====="
# 1. Stop PostgreSQL
docker-compose stop postgres
# 2. Backup current data (safety)
mv /data/postgres /data/postgres_backup_$(date +%Y%m%d_%H%M%S)
# 3. Restore base backup
mkdir -p /data/postgres
pg_basebackup -h postgres -U erp_user -D /data/postgres -Fp -Xs -P
# 4. Create recovery configuration
cat > /data/postgres/recovery.conf <<EOF
restore_command = 'cp /backups/wal/%f %p'
recovery_target_time = '$TARGET_TIME'
recovery_target_action = 'promote'
EOF
# 5. Start PostgreSQL (will enter recovery mode)
docker-compose start postgres
# 6. Monitor recovery
tail -f /var/log/postgresql/postgresql-*.log | grep -i recovery
echo "Recovery complete! Database restored to $TARGET_TIME"
7.2 PITR Best Practices
- Test PITR quarterly in non-production environment
- Document recovery times (usually 30 min - 2 hours depending on WAL size)
- Automate WAL archiving to S3/cloud storage
- Monitor WAL archive lag (should be <5 minutes)
8. DISASTER RECOVERY PLAYBOOK
8.1 Disaster Scenarios
Scenario 1: Complete Data Center Failure
Symptoms:
- All servers unreachable
- Network connectivity lost
- Physical infrastructure damaged
Recovery Steps:
# 1. Provision new infrastructure (AWS, Azure, GCP)
terraform apply -var-file=disaster-recovery.tfvars
# 2. Download latest backup from cloud storage
aws s3 sync s3://erp-generic-backups-prod/postgres/ /backups/postgres/ --exclude "*" --include "full_*"
# 3. Deploy Docker containers
docker-compose -f docker-compose.prod.yml up -d postgres redis
# 4. Restore database
LATEST_BACKUP=$(ls -t /backups/postgres/full_*.dump | head -1)
./scripts/restore-postgres.sh "$LATEST_BACKUP" --no-prompt
# 5. Restore WAL archives for PITR
aws s3 sync s3://erp-generic-backups-prod/wal/ /backups/wal/
# 6. Deploy application
docker-compose -f docker-compose.prod.yml up -d backend frontend nginx
# 7. Update DNS (point to new infrastructure)
# Manual step or use Route53/CloudFlare API
# 8. Verify functionality
./scripts/health-check.sh
npm run test:smoke
# 9. Notify stakeholders
# Send email/Slack notification
Estimated RTO: 4 hours Estimated RPO: 15 minutes (last WAL archive)
Scenario 2: Accidental Data Deletion
Symptoms:
- Critical data deleted by user error
- "DELETE FROM sales.orders WHERE..." without WHERE clause
Recovery Steps:
# 1. Identify deletion time
SELECT * FROM system.audit_logs WHERE event = 'DELETE' AND table_name = 'sales.orders' ORDER BY created_at DESC LIMIT 10;
# 2. Use PITR to restore to before deletion
./scripts/pitr-restore.sh --target-time="2025-11-24 14:25:00"
# 3. Export deleted records from restored database
psql -h restored-db -U erp_user -d erp_generic -c "COPY (SELECT * FROM sales.orders WHERE created_at >= '2025-11-24 14:00:00') TO '/tmp/recovered_orders.csv' CSV HEADER;"
# 4. Import recovered records to production
psql -h postgres -U erp_user -d erp_generic -c "\COPY sales.orders FROM '/tmp/recovered_orders.csv' CSV HEADER;"
echo "✓ Deleted records recovered"
Estimated RTO: 1 hour Estimated RPO: 0 (no data loss if caught quickly)
Scenario 3: Database Corruption
Symptoms:
- PostgreSQL fails to start
- "corrupt page" errors in logs
- Data inconsistencies
Recovery Steps:
# 1. Attempt automatic repair
docker-compose exec postgres pg_resetwal /var/lib/postgresql/data
# 2. If repair fails, restore from backup
./scripts/restore-postgres.sh /backups/postgres/full_20251124_020000.dump
# 3. Run VACUUM and ANALYZE
psql -h postgres -U erp_user -d erp_generic -c "VACUUM FULL; ANALYZE;"
# 4. Rebuild indexes
psql -h postgres -U erp_user -d erp_generic -c "REINDEX DATABASE erp_generic;"
9. BACKUP TESTING
9.1 Monthly Restore Test
File: scripts/test-restore.sh
#!/bin/bash
# Monthly backup restore test
BACKUP_FILE=$(ls -t /backups/postgres/full_*.dump | head -1)
TEST_DB="erp_generic_restore_test"
echo "===== Backup Restore Test ====="
echo "Backup: $BACKUP_FILE"
# 1. Drop test database if exists
psql -h postgres -U erp_user -d postgres -c "DROP DATABASE IF EXISTS $TEST_DB;"
# 2. Create test database
psql -h postgres -U erp_user -d postgres -c "CREATE DATABASE $TEST_DB;"
# 3. Restore backup
pg_restore -h postgres -U erp_user -d $TEST_DB --clean --if-exists "$BACKUP_FILE"
# 4. Run smoke tests
psql -h postgres -U erp_user -d $TEST_DB <<SQL
-- Verify table counts
SELECT COUNT(*) AS user_count FROM auth.users;
SELECT COUNT(*) AS partner_count FROM core.partners;
SELECT COUNT(*) AS order_count FROM sales.orders;
-- Verify data integrity
SELECT 'OK' AS status WHERE (
SELECT COUNT(*) FROM auth.users
) > 0;
SQL
# 5. Cleanup
psql -h postgres -U erp_user -d postgres -c "DROP DATABASE $TEST_DB;"
echo "✓ Backup restore test passed"
9.2 Quarterly DR Drill
Checklist:
- Provision new infrastructure (staging environment)
- Restore from cloud backup (S3)
- Verify all 9 schemas restored
- Run full test suite (unit + integration + E2E)
- Measure RTO (actual time to restore)
- Measure RPO (data loss amount)
- Document findings and improvements
- Update DR playbook
Success Criteria:
- RTO < 4 hours
- RPO < 15 minutes
- All tests passing
- Zero critical data loss
10. REFERENCES
Internal Documentation:
External Resources:
Documento: BACKUP-RECOVERY.md Versión: 1.0 Total Páginas: ~12 Última Actualización: 2025-11-24