rckrdmrd/erp-core

Fork 0

rckrdmrd 59f1e3badf Initial commit - erp-core

2026-01-04 06:12:07 -06:00

30 KiB

Raw Blame History

BACKUP & RECOVERY - ERP Generic

Última actualización: 2025-11-24 Responsable: DevOps Team / DBA Team Estado: ✅ Production-Ready

Overview
Backup Strategy
Backup Scripts
Multi-Tenant Backup Isolation
Retention Policy
Recovery Procedures
Point-in-Time Recovery (PITR)
Disaster Recovery Playbook
Backup Testing
References

1. OVERVIEW

1.1 Backup Architecture

┌─────────────────────────────────────────────────────────────┐
│                   PostgreSQL 16 Database                    │
│  ┌──────────┐  ┌──────────┐  ┌──────────┐  ┌──────────┐   │
│  │  auth    │  │  core    │  │ financial│  │ inventory│   │
│  │  schema  │  │  schema  │  │  schema  │  │  schema  │   │
│  └────┬─────┘  └────┬─────┘  └────┬─────┘  └────┬─────┘   │
│       │             │             │             │           │
│  ┌──────────┐  ┌──────────┐  ┌──────────┐  ┌──────────┐   │
│  │ purchase │  │  sales   │  │ analytics│  │ projects │   │
│  │  schema  │  │  schema  │  │  schema  │  │  schema  │   │
│  └────┬─────┘  └────┬─────┘  └────┬─────┘  └────┬─────┘   │
│       │             │             │             │           │
│  ┌──────────┐                                               │
│  │  system  │                                               │
│  │  schema  │                                               │
│  └────┬─────┘                                               │
└───────┼───────────────────────────────────────────────────┘
        │
        ↓ (Automated backup every 4 hours)
┌─────────────────────────────────────────────────────────────┐
│              Local Backup Storage (/backups)                │
│  ┌──────────────┐  ┌──────────────┐  ┌──────────────┐     │
│  │ Full Backup  │  │  Incremental │  │   Per-Schema │     │
│  │   (Daily)    │  │  (4 hours)   │  │    Backups   │     │
│  └──────┬───────┘  └──────┬───────┘  └──────┬───────┘     │
└─────────┼──────────────────┼──────────────────┼─────────────┘
          │                  │                  │
          │ (Sync every hour)│                  │
          ↓                  ↓                  ↓
┌─────────────────────────────────────────────────────────────┐
│           Cloud Storage (S3 / Azure Blob / GCS)             │
│  ┌──────────────────────────────────────────────────────┐  │
│  │  Versioning Enabled | Lifecycle Rules | Encryption   │  │
│  │  Retention: 7d + 4w + 12m                            │  │
│  └──────────────────────────────────────────────────────┘  │
└─────────────────────────────────────────────────────────────┘

┌─────────────────────────────────────────────────────────────┐
│              WAL (Write-Ahead Logs) Archive                 │
│  Continuous archiving for Point-in-Time Recovery (PITR)    │
│  Retention: 7 days                                          │
└─────────────────────────────────────────────────────────────┘

1.2 Backup Objectives

RTO (Recovery Time Objective): 4 hours

Maximum time to restore service after failure

RPO (Recovery Point Objective): 15 minutes

Maximum data loss acceptable (via WAL archiving)

Data Durability: 99.999999999% (11 nines)

Multi-region cloud storage replication

Backup Types:

Full Backup: Complete database snapshot (daily at 2:00 AM)
Incremental Backup: Changes since last full backup (every 4 hours)
WAL Archive: Continuous archiving for PITR (every 60 seconds)
Per-Schema Backup: Individual schema backups for multi-tenant isolation (daily)

2. BACKUP STRATEGY

2.1 Backup Schedule

Backup Type	Frequency	Retention	Size (Est.)	Duration
Full Backup	Daily (2:00 AM)	7 days local + 12 months cloud	50-100 GB	30-45 min
Incremental Backup	Every 4 hours	7 days local	5-10 GB	5-10 min
WAL Archive	Continuous (every 16 MB)	7 days	100-200 GB/week	Real-time
Per-Schema Backup	Daily (3:00 AM)	7 days local + 4 weeks cloud	5-15 GB each	5-10 min each
Config Backup	On change + daily	30 days	<100 MB	<1 min

2.2 Backup Storage Locations

Primary Storage (Local):

Path: /backups/postgres/
Filesystem: XFS (optimized for large files)
Capacity: 500 GB minimum
RAID: RAID 10 (performance + redundancy)

Secondary Storage (Cloud):

Provider: AWS S3 / Azure Blob Storage / Google Cloud Storage
Bucket: erp-generic-backups-prod
Region: Multi-region replication (e.g., us-east-1 + us-west-2)
Encryption: AES-256 at rest
Versioning: Enabled (30 versions max)
Lifecycle Rules:
- Move to Glacier after 30 days
- Delete after 1 year (except annual backups)

Tertiary Storage (Offsite):

Type: Tape backup / Cold storage
Frequency: Monthly
Retention: 7 years (compliance requirement)

2.3 Backup Validation

Automated Validation (Daily):

# 1. Verify backup file integrity
md5sum /backups/postgres/full_20251124_020000.dump > /backups/postgres/full_20251124_020000.dump.md5

# 2. Test restore to staging environment (weekly)
pg_restore --dbname=erp_generic_staging --clean --if-exists /backups/postgres/full_20251124_020000.dump

# 3. Run smoke tests on restored database
psql -d erp_generic_staging -c "SELECT COUNT(*) FROM auth.users;"
psql -d erp_generic_staging -c "SELECT COUNT(*) FROM core.partners;"

# 4. Compare row counts with production
diff <(psql -d erp_generic -tAc "SELECT tablename, n_live_tup FROM pg_stat_user_tables ORDER BY tablename") \
     <(psql -d erp_generic_staging -tAc "SELECT tablename, n_live_tup FROM pg_stat_user_tables ORDER BY tablename")

Manual Validation (Monthly):

Full disaster recovery drill
Restore to isolated environment
Verify business-critical data
Test application functionality
Document findings in post-mortem

3. BACKUP SCRIPTS

3.1 Full Backup Script

File: scripts/backup-postgres.sh

#!/bin/bash
# =====================================================
# ERP GENERIC - PostgreSQL Full Backup Script
# Performs full database backup with multi-schema support
# =====================================================

set -euo pipefail

# Configuration
BACKUP_DIR="/backups/postgres"
TIMESTAMP=$(date +"%Y%m%d_%H%M%S")
RETENTION_DAYS=7
DB_HOST="${POSTGRES_HOST:-postgres}"
DB_PORT="${POSTGRES_PORT:-5432}"
DB_NAME="${POSTGRES_DB:-erp_generic}"
DB_USER="${POSTGRES_USER:-erp_user}"
PGPASSWORD="${POSTGRES_PASSWORD}"
export PGPASSWORD

# Logging
LOG_FILE="/var/log/erp-generic/backup.log"
exec > >(tee -a "$LOG_FILE")
exec 2>&1

echo "===== PostgreSQL Backup Started at $(date) ====="

# Create backup directory if not exists
mkdir -p "$BACKUP_DIR"

# 1. Full Database Backup
echo "1. Creating full database backup..."
FULL_BACKUP_FILE="${BACKUP_DIR}/full_${TIMESTAMP}.dump"
pg_dump -h "$DB_HOST" -p "$DB_PORT" -U "$DB_USER" -Fc -v -d "$DB_NAME" -f "$FULL_BACKUP_FILE"

if [ $? -eq 0 ]; then
    echo "✓ Full backup created: $FULL_BACKUP_FILE"
    FILE_SIZE=$(du -h "$FULL_BACKUP_FILE" | cut -f1)
    echo "  Size: $FILE_SIZE"
else
    echo "✗ Full backup failed!"
    exit 1
fi

# 2. Generate MD5 checksum
echo "2. Generating checksum..."
md5sum "$FULL_BACKUP_FILE" > "${FULL_BACKUP_FILE}.md5"
echo "✓ Checksum saved: ${FULL_BACKUP_FILE}.md5"

# 3. Per-Schema Backups (Multi-Tenant Isolation)
echo "3. Creating per-schema backups..."
SCHEMAS=("auth" "core" "financial" "inventory" "purchase" "sales" "analytics" "projects" "system")

for schema in "${SCHEMAS[@]}"; do
    SCHEMA_BACKUP_FILE="${BACKUP_DIR}/${schema}_${TIMESTAMP}.dump"
    echo "  Backing up schema: $schema"

    pg_dump -h "$DB_HOST" -p "$DB_PORT" -U "$DB_USER" -Fc -n "$schema" -d "$DB_NAME" -f "$SCHEMA_BACKUP_FILE"

    if [ $? -eq 0 ]; then
        SCHEMA_SIZE=$(du -h "$SCHEMA_BACKUP_FILE" | cut -f1)
        echo "  ✓ $schema backup created ($SCHEMA_SIZE)"
    else
        echo "  ✗ $schema backup failed!"
    fi
done

# 4. Backup Database Roles and Permissions
echo "4. Backing up database roles..."
ROLES_BACKUP_FILE="${BACKUP_DIR}/roles_${TIMESTAMP}.sql"
pg_dumpall -h "$DB_HOST" -p "$DB_PORT" -U "$DB_USER" --roles-only -f "$ROLES_BACKUP_FILE"
echo "✓ Roles backup created: $ROLES_BACKUP_FILE"

# 5. Backup PostgreSQL Configuration
echo "5. Backing up PostgreSQL configuration..."
CONFIG_BACKUP_DIR="${BACKUP_DIR}/config_${TIMESTAMP}"
mkdir -p "$CONFIG_BACKUP_DIR"

# Copy config files (if accessible)
if [ -f /etc/postgresql/16/main/postgresql.conf ]; then
    cp /etc/postgresql/16/main/postgresql.conf "$CONFIG_BACKUP_DIR/"
    cp /etc/postgresql/16/main/pg_hba.conf "$CONFIG_BACKUP_DIR/"
    echo "✓ Config files backed up"
fi

# 6. Backup WAL Archive Status
echo "6. Recording WAL archive status..."
psql -h "$DB_HOST" -p "$DB_PORT" -U "$DB_USER" -d "$DB_NAME" -tAc "SELECT * FROM pg_stat_archiver;" > "${BACKUP_DIR}/wal_status_${TIMESTAMP}.txt"

# 7. Upload to Cloud Storage (S3)
if [ -n "${AWS_S3_BUCKET:-}" ]; then
    echo "7. Uploading to S3..."
    aws s3 cp "$FULL_BACKUP_FILE" "s3://${AWS_S3_BUCKET}/postgres/${TIMESTAMP}/" --storage-class STANDARD_IA
    aws s3 cp "${FULL_BACKUP_FILE}.md5" "s3://${AWS_S3_BUCKET}/postgres/${TIMESTAMP}/"

    # Upload per-schema backups
    for schema in "${SCHEMAS[@]}"; do
        SCHEMA_BACKUP_FILE="${BACKUP_DIR}/${schema}_${TIMESTAMP}.dump"
        if [ -f "$SCHEMA_BACKUP_FILE" ]; then
            aws s3 cp "$SCHEMA_BACKUP_FILE" "s3://${AWS_S3_BUCKET}/postgres/${TIMESTAMP}/schemas/" --storage-class STANDARD_IA
        fi
    done

    echo "✓ Backup uploaded to S3"
fi

# 8. Cleanup Old Backups (Local)
echo "8. Cleaning up old backups (older than $RETENTION_DAYS days)..."
find "$BACKUP_DIR" -type f -name "*.dump" -mtime +$RETENTION_DAYS -delete
find "$BACKUP_DIR" -type f -name "*.sql" -mtime +$RETENTION_DAYS -delete
find "$BACKUP_DIR" -type f -name "*.md5" -mtime +$RETENTION_DAYS -delete
find "$BACKUP_DIR" -type d -name "config_*" -mtime +$RETENTION_DAYS -exec rm -rf {} + 2>/dev/null || true
echo "✓ Old backups cleaned up"

# 9. Verify Backup Integrity
echo "9. Verifying backup integrity..."
md5sum -c "${FULL_BACKUP_FILE}.md5"
if [ $? -eq 0 ]; then
    echo "✓ Backup integrity verified"
else
    echo "✗ Backup integrity check failed!"
    exit 1
fi

# 10. Send Notification
echo "10. Sending backup notification..."
BACKUP_SIZE=$(du -sh "$BACKUP_DIR" | cut -f1)

# Slack notification (optional)
if [ -n "${SLACK_WEBHOOK_URL:-}" ]; then
    curl -X POST "$SLACK_WEBHOOK_URL" \
        -H 'Content-Type: application/json' \
        -d "{\"text\": \"✅ PostgreSQL backup completed successfully\n• Database: $DB_NAME\n• Size: $FILE_SIZE\n• Timestamp: $TIMESTAMP\n• Total backup dir size: $BACKUP_SIZE\"}"
fi

echo "===== PostgreSQL Backup Completed at $(date) ====="
echo "Total backup size: $BACKUP_SIZE"
echo "Backup location: $BACKUP_DIR"

# Exit successfully
exit 0

3.2 Incremental Backup Script

File: scripts/backup-postgres-incremental.sh

#!/bin/bash
# =====================================================
# ERP GENERIC - PostgreSQL Incremental Backup Script
# Uses pg_dump with --snapshot for consistency
# =====================================================

set -euo pipefail

BACKUP_DIR="/backups/postgres/incremental"
TIMESTAMP=$(date +"%Y%m%d_%H%M%S")
DB_HOST="${POSTGRES_HOST:-postgres}"
DB_NAME="${POSTGRES_DB:-erp_generic}"
DB_USER="${POSTGRES_USER:-erp_user}"
PGPASSWORD="${POSTGRES_PASSWORD}"
export PGPASSWORD

mkdir -p "$BACKUP_DIR"

echo "===== Incremental Backup Started at $(date) ====="

# Get last full backup timestamp
LAST_FULL_BACKUP=$(ls -t /backups/postgres/full_*.dump | head -1 | grep -oP '\d{8}_\d{6}')
echo "Last full backup: $LAST_FULL_BACKUP"

# Incremental backup: Export only changed data (simplified approach)
# In production, consider using WAL-based incremental backups or pg_basebackup

INCREMENTAL_FILE="${BACKUP_DIR}/incremental_${TIMESTAMP}.dump"
pg_dump -h "$DB_HOST" -U "$DB_USER" -Fc -d "$DB_NAME" -f "$INCREMENTAL_FILE"

echo "✓ Incremental backup created: $INCREMENTAL_FILE"
echo "===== Incremental Backup Completed at $(date) ====="

3.3 WAL Archiving Configuration

File: postgresql.conf (WAL archiving section)

# WAL Settings for Point-in-Time Recovery
wal_level = replica
archive_mode = on
archive_command = 'test ! -f /backups/wal/%f && cp %p /backups/wal/%f && aws s3 cp /backups/wal/%f s3://${AWS_S3_BUCKET}/wal/'
archive_timeout = 60s
max_wal_senders = 3
wal_keep_size = 1GB

3.4 Cron Schedule

File: /etc/cron.d/erp-backup

# ERP Generic Backup Schedule

# Full backup daily at 2:00 AM
0 2 * * * root /opt/erp-generic/scripts/backup-postgres.sh >> /var/log/erp-generic/backup.log 2>&1

# Incremental backup every 4 hours
0 */4 * * * root /opt/erp-generic/scripts/backup-postgres-incremental.sh >> /var/log/erp-generic/backup.log 2>&1

# Verify backups daily at 4:00 AM
0 4 * * * root /opt/erp-generic/scripts/verify-backup.sh >> /var/log/erp-generic/backup.log 2>&1

# Cleanup old WAL files daily at 5:00 AM
0 5 * * * root find /backups/wal -type f -mtime +7 -delete

# Weekly full disaster recovery test (Sundays at 3:00 AM)
0 3 * * 0 root /opt/erp-generic/scripts/test-restore.sh >> /var/log/erp-generic/backup-test.log 2>&1

4. MULTI-TENANT BACKUP ISOLATION

4.1 Per-Tenant Backup Strategy

Why Per-Schema Backups?

Restore individual tenant without affecting others
Compliance: GDPR right to erasure (delete tenant data)
Tenant migration to dedicated instance
Faster restore times for single tenant issues

Backup Structure:

/backups/postgres/
├── full_20251124_020000.dump              # All schemas
├── auth_20251124_020000.dump              # Auth schema only
├── core_20251124_020000.dump              # Core schema only
├── financial_20251124_020000.dump         # Financial schema only
├── inventory_20251124_020000.dump         # Inventory schema only
├── purchase_20251124_020000.dump          # Purchase schema only
├── sales_20251124_020000.dump             # Sales schema only
├── analytics_20251124_020000.dump         # Analytics schema only
├── projects_20251124_020000.dump          # Projects schema only
└── system_20251124_020000.dump            # System schema only

4.2 Restore Single Tenant

#!/bin/bash
# Restore single tenant (schema isolation)

TENANT_ID="tenant-abc"
SCHEMA_NAME="financial"  # Example: restore financial schema only
BACKUP_FILE="/backups/postgres/financial_20251124_020000.dump"

echo "Restoring schema $SCHEMA_NAME for tenant $TENANT_ID..."

# Option 1: Drop and restore entire schema
psql -h postgres -U erp_user -d erp_generic -c "DROP SCHEMA IF EXISTS $SCHEMA_NAME CASCADE;"
pg_restore -h postgres -U erp_user -d erp_generic -n $SCHEMA_NAME --clean --if-exists "$BACKUP_FILE"

# Option 2: Restore to temporary schema, then copy tenant-specific data
pg_restore -h postgres -U erp_user -d erp_generic -n ${SCHEMA_NAME}_temp --create "$BACKUP_FILE"

# Copy tenant-specific data
psql -h postgres -U erp_user -d erp_generic <<SQL
-- Copy tenant data from temp schema
INSERT INTO $SCHEMA_NAME.accounts
SELECT * FROM ${SCHEMA_NAME}_temp.accounts WHERE tenant_id = '$TENANT_ID';

-- Drop temp schema
DROP SCHEMA ${SCHEMA_NAME}_temp CASCADE;
SQL

echo "✓ Tenant $TENANT_ID restored from $SCHEMA_NAME schema"

5. RETENTION POLICY

5.1 Retention Rules

Local Storage (Fast Recovery):

Full Backups: 7 days
Incremental Backups: 7 days
WAL Archives: 7 days
Config Backups: 30 days

Cloud Storage (Long-term Retention):

Daily Backups: 30 days (Standard storage)
Weekly Backups: 12 weeks (Standard-IA storage)
Monthly Backups: 12 months (Glacier storage)
Annual Backups: 7 years (Glacier Deep Archive) - Compliance

Lifecycle Policy (AWS S3 Example):

{
  "Rules": [
    {
      "Id": "TransitionToIA",
      "Status": "Enabled",
      "Transitions": [
        {
          "Days": 30,
          "StorageClass": "STANDARD_IA"
        }
      ]
    },
    {
      "Id": "TransitionToGlacier",
      "Status": "Enabled",
      "Transitions": [
        {
          "Days": 90,
          "StorageClass": "GLACIER"
        }
      ]
    },
    {
      "Id": "DeleteOldBackups",
      "Status": "Enabled",
      "Expiration": {
        "Days": 365
      },
      "Filter": {
        "Prefix": "postgres/daily/"
      }
    }
  ]
}

5.2 Compliance Requirements

GDPR (EU):

Right to erasure: Ability to delete tenant data
Backup encryption: AES-256
Access logging: Who accessed what backup

SOC 2:

Backup testing: Monthly restore drills
Access controls: Role-based access to backups
Audit trail: Log all backup/restore operations

Industry-Specific:

Healthcare (HIPAA): 6 years retention
Financial (SOX): 7 years retention
Government: Variable (check local regulations)

6. RECOVERY PROCEDURES

6.1 Full Database Restore

File: scripts/restore-postgres.sh

#!/bin/bash
# =====================================================
# ERP GENERIC - PostgreSQL Restore Script
# Restores database from backup file
# =====================================================

set -euo pipefail

# Usage
if [ $# -lt 1 ]; then
    echo "Usage: $0 <backup_file> [--target=<database>] [--no-prompt]"
    echo "Example: $0 /backups/postgres/full_20251124_020000.dump --target=erp_generic_staging"
    exit 1
fi

BACKUP_FILE="$1"
TARGET_DB="${2:-erp_generic}"
NO_PROMPT=false

# Parse arguments
for arg in "$@"; do
    case $arg in
        --target=*)
            TARGET_DB="${arg#*=}"
            shift
            ;;
        --no-prompt)
            NO_PROMPT=true
            shift
            ;;
    esac
done

DB_HOST="${POSTGRES_HOST:-postgres}"
DB_USER="${POSTGRES_USER:-erp_user}"
PGPASSWORD="${POSTGRES_PASSWORD}"
export PGPASSWORD

echo "===== PostgreSQL Restore Started at $(date) ====="
echo "Backup file: $BACKUP_FILE"
echo "Target database: $TARGET_DB"

# Verify backup file exists
if [ ! -f "$BACKUP_FILE" ]; then
    echo "✗ Backup file not found: $BACKUP_FILE"
    exit 1
fi

# Verify checksum if exists
if [ -f "${BACKUP_FILE}.md5" ]; then
    echo "Verifying backup integrity..."
    md5sum -c "${BACKUP_FILE}.md5"
    if [ $? -ne 0 ]; then
        echo "✗ Backup integrity check failed!"
        exit 1
    fi
    echo "✓ Backup integrity verified"
fi

# Safety prompt (unless --no-prompt)
if [ "$NO_PROMPT" = false ]; then
    echo ""
    echo "⚠️  WARNING: This will OVERWRITE all data in database '$TARGET_DB'"
    echo "⚠️  Make sure you have a recent backup before proceeding!"
    echo ""
    read -p "Are you sure you want to continue? (yes/no): " -r
    if [[ ! $REPLY =~ ^[Yy][Ee][Ss]$ ]]; then
        echo "Restore cancelled by user."
        exit 0
    fi
fi

# Create safety backup of target database (if not staging/dev)
if [[ ! "$TARGET_DB" =~ (staging|dev|test) ]]; then
    echo "Creating safety backup of $TARGET_DB before restore..."
    SAFETY_BACKUP="/backups/postgres/safety_${TARGET_DB}_$(date +%Y%m%d_%H%M%S).dump"
    pg_dump -h "$DB_HOST" -U "$DB_USER" -Fc -d "$TARGET_DB" -f "$SAFETY_BACKUP"
    echo "✓ Safety backup created: $SAFETY_BACKUP"
fi

# Terminate active connections to target database
echo "Terminating active connections to $TARGET_DB..."
psql -h "$DB_HOST" -U "$DB_USER" -d postgres <<SQL
SELECT pg_terminate_backend(pid)
FROM pg_stat_activity
WHERE datname = '$TARGET_DB' AND pid <> pg_backend_pid();
SQL

# Restore database
echo "Restoring database from $BACKUP_FILE..."
pg_restore -h "$DB_HOST" -U "$DB_USER" -d "$TARGET_DB" --clean --if-exists --verbose "$BACKUP_FILE"

if [ $? -eq 0 ]; then
    echo "✓ Database restored successfully"
else
    echo "✗ Restore failed!"
    exit 1
fi

# Verify restore
echo "Verifying restore..."
TABLE_COUNT=$(psql -h "$DB_HOST" -U "$DB_USER" -d "$TARGET_DB" -tAc "SELECT COUNT(*) FROM information_schema.tables WHERE table_schema NOT IN ('pg_catalog', 'information_schema');")
USER_COUNT=$(psql -h "$DB_HOST" -U "$DB_USER" -d "$TARGET_DB" -tAc "SELECT COUNT(*) FROM auth.users;" 2>/dev/null || echo "N/A")

echo "Tables restored: $TABLE_COUNT"
echo "Users in auth.users: $USER_COUNT"

# Rebuild statistics
echo "Rebuilding database statistics..."
psql -h "$DB_HOST" -U "$DB_USER" -d "$TARGET_DB" -c "ANALYZE;"
echo "✓ Statistics rebuilt"

echo "===== PostgreSQL Restore Completed at $(date) ====="
echo ""
echo "Next steps:"
echo "1. Verify application functionality"
echo "2. Run smoke tests: npm run test:smoke"
echo "3. Check logs for errors"
echo "4. Notify team that restore is complete"

exit 0

6.2 Schema-Level Restore

# Restore single schema
SCHEMA="financial"
BACKUP_FILE="/backups/postgres/financial_20251124_020000.dump"

pg_restore -h postgres -U erp_user -d erp_generic -n $SCHEMA --clean --if-exists "$BACKUP_FILE"

6.3 Table-Level Restore

# Restore single table
TABLE="auth.users"
BACKUP_FILE="/backups/postgres/full_20251124_020000.dump"

pg_restore -h postgres -U erp_user -d erp_generic -t $TABLE --data-only "$BACKUP_FILE"

7. POINT-IN-TIME RECOVERY (PITR)

7.1 PITR Process

Use Case: Restore database to specific point in time (e.g., before accidental DELETE query)

Requirements:

Base backup (full backup)
WAL archives from base backup to target time

Steps:

#!/bin/bash
# Point-in-Time Recovery (PITR)

TARGET_TIME="2025-11-24 14:30:00"
BASE_BACKUP="/backups/postgres/full_20251124_020000.dump"
WAL_ARCHIVE_DIR="/backups/wal"

echo "===== Point-in-Time Recovery to $TARGET_TIME ====="

# 1. Stop PostgreSQL
docker-compose stop postgres

# 2. Backup current data (safety)
mv /data/postgres /data/postgres_backup_$(date +%Y%m%d_%H%M%S)

# 3. Restore base backup
mkdir -p /data/postgres
pg_basebackup -h postgres -U erp_user -D /data/postgres -Fp -Xs -P

# 4. Create recovery configuration
cat > /data/postgres/recovery.conf <<EOF
restore_command = 'cp /backups/wal/%f %p'
recovery_target_time = '$TARGET_TIME'
recovery_target_action = 'promote'
EOF

# 5. Start PostgreSQL (will enter recovery mode)
docker-compose start postgres

# 6. Monitor recovery
tail -f /var/log/postgresql/postgresql-*.log | grep -i recovery

echo "Recovery complete! Database restored to $TARGET_TIME"

7.2 PITR Best Practices

Test PITR quarterly in non-production environment
Document recovery times (usually 30 min - 2 hours depending on WAL size)
Automate WAL archiving to S3/cloud storage
Monitor WAL archive lag (should be <5 minutes)

8. DISASTER RECOVERY PLAYBOOK

8.1 Disaster Scenarios

Scenario 1: Complete Data Center Failure

Symptoms:

All servers unreachable
Network connectivity lost
Physical infrastructure damaged

Recovery Steps:

# 1. Provision new infrastructure (AWS, Azure, GCP)
terraform apply -var-file=disaster-recovery.tfvars

# 2. Download latest backup from cloud storage
aws s3 sync s3://erp-generic-backups-prod/postgres/ /backups/postgres/ --exclude "*" --include "full_*"

# 3. Deploy Docker containers
docker-compose -f docker-compose.prod.yml up -d postgres redis

# 4. Restore database
LATEST_BACKUP=$(ls -t /backups/postgres/full_*.dump | head -1)
./scripts/restore-postgres.sh "$LATEST_BACKUP" --no-prompt

# 5. Restore WAL archives for PITR
aws s3 sync s3://erp-generic-backups-prod/wal/ /backups/wal/

# 6. Deploy application
docker-compose -f docker-compose.prod.yml up -d backend frontend nginx

# 7. Update DNS (point to new infrastructure)
# Manual step or use Route53/CloudFlare API

# 8. Verify functionality
./scripts/health-check.sh
npm run test:smoke

# 9. Notify stakeholders
# Send email/Slack notification

Estimated RTO: 4 hours Estimated RPO: 15 minutes (last WAL archive)

Scenario 2: Accidental Data Deletion

Symptoms:

Critical data deleted by user error
"DELETE FROM sales.orders WHERE..." without WHERE clause

Recovery Steps:

# 1. Identify deletion time
SELECT * FROM system.audit_logs WHERE event = 'DELETE' AND table_name = 'sales.orders' ORDER BY created_at DESC LIMIT 10;

# 2. Use PITR to restore to before deletion
./scripts/pitr-restore.sh --target-time="2025-11-24 14:25:00"

# 3. Export deleted records from restored database
psql -h restored-db -U erp_user -d erp_generic -c "COPY (SELECT * FROM sales.orders WHERE created_at >= '2025-11-24 14:00:00') TO '/tmp/recovered_orders.csv' CSV HEADER;"

# 4. Import recovered records to production
psql -h postgres -U erp_user -d erp_generic -c "\COPY sales.orders FROM '/tmp/recovered_orders.csv' CSV HEADER;"

echo "✓ Deleted records recovered"

Estimated RTO: 1 hour Estimated RPO: 0 (no data loss if caught quickly)

Scenario 3: Database Corruption

Symptoms:

PostgreSQL fails to start
"corrupt page" errors in logs
Data inconsistencies

Recovery Steps:

# 1. Attempt automatic repair
docker-compose exec postgres pg_resetwal /var/lib/postgresql/data

# 2. If repair fails, restore from backup
./scripts/restore-postgres.sh /backups/postgres/full_20251124_020000.dump

# 3. Run VACUUM and ANALYZE
psql -h postgres -U erp_user -d erp_generic -c "VACUUM FULL; ANALYZE;"

# 4. Rebuild indexes
psql -h postgres -U erp_user -d erp_generic -c "REINDEX DATABASE erp_generic;"

9. BACKUP TESTING

9.1 Monthly Restore Test

File: scripts/test-restore.sh

#!/bin/bash
# Monthly backup restore test

BACKUP_FILE=$(ls -t /backups/postgres/full_*.dump | head -1)
TEST_DB="erp_generic_restore_test"

echo "===== Backup Restore Test ====="
echo "Backup: $BACKUP_FILE"

# 1. Drop test database if exists
psql -h postgres -U erp_user -d postgres -c "DROP DATABASE IF EXISTS $TEST_DB;"

# 2. Create test database
psql -h postgres -U erp_user -d postgres -c "CREATE DATABASE $TEST_DB;"

# 3. Restore backup
pg_restore -h postgres -U erp_user -d $TEST_DB --clean --if-exists "$BACKUP_FILE"

# 4. Run smoke tests
psql -h postgres -U erp_user -d $TEST_DB <<SQL
-- Verify table counts
SELECT COUNT(*) AS user_count FROM auth.users;
SELECT COUNT(*) AS partner_count FROM core.partners;
SELECT COUNT(*) AS order_count FROM sales.orders;

-- Verify data integrity
SELECT 'OK' AS status WHERE (
    SELECT COUNT(*) FROM auth.users
) > 0;
SQL

# 5. Cleanup
psql -h postgres -U erp_user -d postgres -c "DROP DATABASE $TEST_DB;"

echo "✓ Backup restore test passed"

9.2 Quarterly DR Drill

Checklist:

Provision new infrastructure (staging environment)
Restore from cloud backup (S3)
Verify all 9 schemas restored
Run full test suite (unit + integration + E2E)
Measure RTO (actual time to restore)
Measure RPO (data loss amount)
Document findings and improvements
Update DR playbook

Success Criteria:

RTO < 4 hours
RPO < 15 minutes
All tests passing
Zero critical data loss

10. REFERENCES

Internal Documentation:

External Resources:

Documento: BACKUP-RECOVERY.md Versión: 1.0 Total Páginas: ~12 Última Actualización: 2025-11-24

30 KiB Raw Blame History

BACKUP & RECOVERY - ERP Generic

TABLE OF CONTENTS

1. OVERVIEW

1.1 Backup Architecture

1.2 Backup Objectives

2. BACKUP STRATEGY

2.1 Backup Schedule

2.2 Backup Storage Locations

2.3 Backup Validation

3. BACKUP SCRIPTS

3.1 Full Backup Script

3.2 Incremental Backup Script

3.3 WAL Archiving Configuration

3.4 Cron Schedule

4. MULTI-TENANT BACKUP ISOLATION

4.1 Per-Tenant Backup Strategy

4.2 Restore Single Tenant

5. RETENTION POLICY

5.1 Retention Rules

5.2 Compliance Requirements

6. RECOVERY PROCEDURES

6.1 Full Database Restore

6.2 Schema-Level Restore

6.3 Table-Level Restore

7. POINT-IN-TIME RECOVERY (PITR)

7.1 PITR Process

7.2 PITR Best Practices

8. DISASTER RECOVERY PLAYBOOK

8.1 Disaster Scenarios

9. BACKUP TESTING

9.1 Monthly Restore Test

9.2 Quarterly DR Drill

10. REFERENCES

30 KiB

Raw Blame History