908 lines
30 KiB
Markdown
908 lines
30 KiB
Markdown
# BACKUP & RECOVERY - ERP Generic
|
|
|
|
**Última actualización:** 2025-11-24
|
|
**Responsable:** DevOps Team / DBA Team
|
|
**Estado:** ✅ Production-Ready
|
|
|
|
---
|
|
|
|
## TABLE OF CONTENTS
|
|
|
|
1. [Overview](#1-overview)
|
|
2. [Backup Strategy](#2-backup-strategy)
|
|
3. [Backup Scripts](#3-backup-scripts)
|
|
4. [Multi-Tenant Backup Isolation](#4-multi-tenant-backup-isolation)
|
|
5. [Retention Policy](#5-retention-policy)
|
|
6. [Recovery Procedures](#6-recovery-procedures)
|
|
7. [Point-in-Time Recovery (PITR)](#7-point-in-time-recovery-pitr)
|
|
8. [Disaster Recovery Playbook](#8-disaster-recovery-playbook)
|
|
9. [Backup Testing](#9-backup-testing)
|
|
10. [References](#10-references)
|
|
|
|
---
|
|
|
|
## 1. OVERVIEW
|
|
|
|
### 1.1 Backup Architecture
|
|
|
|
```
|
|
┌─────────────────────────────────────────────────────────────┐
|
|
│ PostgreSQL 16 Database │
|
|
│ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐ │
|
|
│ │ auth │ │ core │ │ financial│ │ inventory│ │
|
|
│ │ schema │ │ schema │ │ schema │ │ schema │ │
|
|
│ └────┬─────┘ └────┬─────┘ └────┬─────┘ └────┬─────┘ │
|
|
│ │ │ │ │ │
|
|
│ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐ │
|
|
│ │ purchase │ │ sales │ │ analytics│ │ projects │ │
|
|
│ │ schema │ │ schema │ │ schema │ │ schema │ │
|
|
│ └────┬─────┘ └────┬─────┘ └────┬─────┘ └────┬─────┘ │
|
|
│ │ │ │ │ │
|
|
│ ┌──────────┐ │
|
|
│ │ system │ │
|
|
│ │ schema │ │
|
|
│ └────┬─────┘ │
|
|
└───────┼───────────────────────────────────────────────────┘
|
|
│
|
|
↓ (Automated backup every 4 hours)
|
|
┌─────────────────────────────────────────────────────────────┐
|
|
│ Local Backup Storage (/backups) │
|
|
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │
|
|
│ │ Full Backup │ │ Incremental │ │ Per-Schema │ │
|
|
│ │ (Daily) │ │ (4 hours) │ │ Backups │ │
|
|
│ └──────┬───────┘ └──────┬───────┘ └──────┬───────┘ │
|
|
└─────────┼──────────────────┼──────────────────┼─────────────┘
|
|
│ │ │
|
|
│ (Sync every hour)│ │
|
|
↓ ↓ ↓
|
|
┌─────────────────────────────────────────────────────────────┐
|
|
│ Cloud Storage (S3 / Azure Blob / GCS) │
|
|
│ ┌──────────────────────────────────────────────────────┐ │
|
|
│ │ Versioning Enabled | Lifecycle Rules | Encryption │ │
|
|
│ │ Retention: 7d + 4w + 12m │ │
|
|
│ └──────────────────────────────────────────────────────┘ │
|
|
└─────────────────────────────────────────────────────────────┘
|
|
|
|
┌─────────────────────────────────────────────────────────────┐
|
|
│ WAL (Write-Ahead Logs) Archive │
|
|
│ Continuous archiving for Point-in-Time Recovery (PITR) │
|
|
│ Retention: 7 days │
|
|
└─────────────────────────────────────────────────────────────┘
|
|
```
|
|
|
|
### 1.2 Backup Objectives
|
|
|
|
**RTO (Recovery Time Objective):** 4 hours
|
|
- Maximum time to restore service after failure
|
|
|
|
**RPO (Recovery Point Objective):** 15 minutes
|
|
- Maximum data loss acceptable (via WAL archiving)
|
|
|
|
**Data Durability:** 99.999999999% (11 nines)
|
|
- Multi-region cloud storage replication
|
|
|
|
**Backup Types:**
|
|
1. **Full Backup:** Complete database snapshot (daily at 2:00 AM)
|
|
2. **Incremental Backup:** Changes since last full backup (every 4 hours)
|
|
3. **WAL Archive:** Continuous archiving for PITR (every 60 seconds)
|
|
4. **Per-Schema Backup:** Individual schema backups for multi-tenant isolation (daily)
|
|
|
|
---
|
|
|
|
## 2. BACKUP STRATEGY
|
|
|
|
### 2.1 Backup Schedule
|
|
|
|
| Backup Type | Frequency | Retention | Size (Est.) | Duration |
|
|
|-------------|-----------|-----------|-------------|----------|
|
|
| **Full Backup** | Daily (2:00 AM) | 7 days local + 12 months cloud | 50-100 GB | 30-45 min |
|
|
| **Incremental Backup** | Every 4 hours | 7 days local | 5-10 GB | 5-10 min |
|
|
| **WAL Archive** | Continuous (every 16 MB) | 7 days | 100-200 GB/week | Real-time |
|
|
| **Per-Schema Backup** | Daily (3:00 AM) | 7 days local + 4 weeks cloud | 5-15 GB each | 5-10 min each |
|
|
| **Config Backup** | On change + daily | 30 days | <100 MB | <1 min |
|
|
|
|
### 2.2 Backup Storage Locations
|
|
|
|
**Primary Storage (Local):**
|
|
- **Path:** `/backups/postgres/`
|
|
- **Filesystem:** XFS (optimized for large files)
|
|
- **Capacity:** 500 GB minimum
|
|
- **RAID:** RAID 10 (performance + redundancy)
|
|
|
|
**Secondary Storage (Cloud):**
|
|
- **Provider:** AWS S3 / Azure Blob Storage / Google Cloud Storage
|
|
- **Bucket:** `erp-generic-backups-prod`
|
|
- **Region:** Multi-region replication (e.g., us-east-1 + us-west-2)
|
|
- **Encryption:** AES-256 at rest
|
|
- **Versioning:** Enabled (30 versions max)
|
|
- **Lifecycle Rules:**
|
|
- Move to Glacier after 30 days
|
|
- Delete after 1 year (except annual backups)
|
|
|
|
**Tertiary Storage (Offsite):**
|
|
- **Type:** Tape backup / Cold storage
|
|
- **Frequency:** Monthly
|
|
- **Retention:** 7 years (compliance requirement)
|
|
|
|
### 2.3 Backup Validation
|
|
|
|
**Automated Validation (Daily):**
|
|
```bash
|
|
# 1. Verify backup file integrity
|
|
md5sum /backups/postgres/full_20251124_020000.dump > /backups/postgres/full_20251124_020000.dump.md5
|
|
|
|
# 2. Test restore to staging environment (weekly)
|
|
pg_restore --dbname=erp_generic_staging --clean --if-exists /backups/postgres/full_20251124_020000.dump
|
|
|
|
# 3. Run smoke tests on restored database
|
|
psql -d erp_generic_staging -c "SELECT COUNT(*) FROM auth.users;"
|
|
psql -d erp_generic_staging -c "SELECT COUNT(*) FROM core.partners;"
|
|
|
|
# 4. Compare row counts with production
|
|
diff <(psql -d erp_generic -tAc "SELECT tablename, n_live_tup FROM pg_stat_user_tables ORDER BY tablename") \
|
|
<(psql -d erp_generic_staging -tAc "SELECT tablename, n_live_tup FROM pg_stat_user_tables ORDER BY tablename")
|
|
```
|
|
|
|
**Manual Validation (Monthly):**
|
|
- Full disaster recovery drill
|
|
- Restore to isolated environment
|
|
- Verify business-critical data
|
|
- Test application functionality
|
|
- Document findings in post-mortem
|
|
|
|
---
|
|
|
|
## 3. BACKUP SCRIPTS
|
|
|
|
### 3.1 Full Backup Script
|
|
|
|
**File:** `scripts/backup-postgres.sh`
|
|
|
|
```bash
|
|
#!/bin/bash
|
|
# =====================================================
|
|
# ERP GENERIC - PostgreSQL Full Backup Script
|
|
# Performs full database backup with multi-schema support
|
|
# =====================================================
|
|
|
|
set -euo pipefail
|
|
|
|
# Configuration
|
|
BACKUP_DIR="/backups/postgres"
|
|
TIMESTAMP=$(date +"%Y%m%d_%H%M%S")
|
|
RETENTION_DAYS=7
|
|
DB_HOST="${POSTGRES_HOST:-postgres}"
|
|
DB_PORT="${POSTGRES_PORT:-5432}"
|
|
DB_NAME="${POSTGRES_DB:-erp_generic}"
|
|
DB_USER="${POSTGRES_USER:-erp_user}"
|
|
PGPASSWORD="${POSTGRES_PASSWORD}"
|
|
export PGPASSWORD
|
|
|
|
# Logging
|
|
LOG_FILE="/var/log/erp-generic/backup.log"
|
|
exec > >(tee -a "$LOG_FILE")
|
|
exec 2>&1
|
|
|
|
echo "===== PostgreSQL Backup Started at $(date) ====="
|
|
|
|
# Create backup directory if not exists
|
|
mkdir -p "$BACKUP_DIR"
|
|
|
|
# 1. Full Database Backup
|
|
echo "1. Creating full database backup..."
|
|
FULL_BACKUP_FILE="${BACKUP_DIR}/full_${TIMESTAMP}.dump"
|
|
pg_dump -h "$DB_HOST" -p "$DB_PORT" -U "$DB_USER" -Fc -v -d "$DB_NAME" -f "$FULL_BACKUP_FILE"
|
|
|
|
if [ $? -eq 0 ]; then
|
|
echo "✓ Full backup created: $FULL_BACKUP_FILE"
|
|
FILE_SIZE=$(du -h "$FULL_BACKUP_FILE" | cut -f1)
|
|
echo " Size: $FILE_SIZE"
|
|
else
|
|
echo "✗ Full backup failed!"
|
|
exit 1
|
|
fi
|
|
|
|
# 2. Generate MD5 checksum
|
|
echo "2. Generating checksum..."
|
|
md5sum "$FULL_BACKUP_FILE" > "${FULL_BACKUP_FILE}.md5"
|
|
echo "✓ Checksum saved: ${FULL_BACKUP_FILE}.md5"
|
|
|
|
# 3. Per-Schema Backups (Multi-Tenant Isolation)
|
|
echo "3. Creating per-schema backups..."
|
|
SCHEMAS=("auth" "core" "financial" "inventory" "purchase" "sales" "analytics" "projects" "system")
|
|
|
|
for schema in "${SCHEMAS[@]}"; do
|
|
SCHEMA_BACKUP_FILE="${BACKUP_DIR}/${schema}_${TIMESTAMP}.dump"
|
|
echo " Backing up schema: $schema"
|
|
|
|
pg_dump -h "$DB_HOST" -p "$DB_PORT" -U "$DB_USER" -Fc -n "$schema" -d "$DB_NAME" -f "$SCHEMA_BACKUP_FILE"
|
|
|
|
if [ $? -eq 0 ]; then
|
|
SCHEMA_SIZE=$(du -h "$SCHEMA_BACKUP_FILE" | cut -f1)
|
|
echo " ✓ $schema backup created ($SCHEMA_SIZE)"
|
|
else
|
|
echo " ✗ $schema backup failed!"
|
|
fi
|
|
done
|
|
|
|
# 4. Backup Database Roles and Permissions
|
|
echo "4. Backing up database roles..."
|
|
ROLES_BACKUP_FILE="${BACKUP_DIR}/roles_${TIMESTAMP}.sql"
|
|
pg_dumpall -h "$DB_HOST" -p "$DB_PORT" -U "$DB_USER" --roles-only -f "$ROLES_BACKUP_FILE"
|
|
echo "✓ Roles backup created: $ROLES_BACKUP_FILE"
|
|
|
|
# 5. Backup PostgreSQL Configuration
|
|
echo "5. Backing up PostgreSQL configuration..."
|
|
CONFIG_BACKUP_DIR="${BACKUP_DIR}/config_${TIMESTAMP}"
|
|
mkdir -p "$CONFIG_BACKUP_DIR"
|
|
|
|
# Copy config files (if accessible)
|
|
if [ -f /etc/postgresql/16/main/postgresql.conf ]; then
|
|
cp /etc/postgresql/16/main/postgresql.conf "$CONFIG_BACKUP_DIR/"
|
|
cp /etc/postgresql/16/main/pg_hba.conf "$CONFIG_BACKUP_DIR/"
|
|
echo "✓ Config files backed up"
|
|
fi
|
|
|
|
# 6. Backup WAL Archive Status
|
|
echo "6. Recording WAL archive status..."
|
|
psql -h "$DB_HOST" -p "$DB_PORT" -U "$DB_USER" -d "$DB_NAME" -tAc "SELECT * FROM pg_stat_archiver;" > "${BACKUP_DIR}/wal_status_${TIMESTAMP}.txt"
|
|
|
|
# 7. Upload to Cloud Storage (S3)
|
|
if [ -n "${AWS_S3_BUCKET:-}" ]; then
|
|
echo "7. Uploading to S3..."
|
|
aws s3 cp "$FULL_BACKUP_FILE" "s3://${AWS_S3_BUCKET}/postgres/${TIMESTAMP}/" --storage-class STANDARD_IA
|
|
aws s3 cp "${FULL_BACKUP_FILE}.md5" "s3://${AWS_S3_BUCKET}/postgres/${TIMESTAMP}/"
|
|
|
|
# Upload per-schema backups
|
|
for schema in "${SCHEMAS[@]}"; do
|
|
SCHEMA_BACKUP_FILE="${BACKUP_DIR}/${schema}_${TIMESTAMP}.dump"
|
|
if [ -f "$SCHEMA_BACKUP_FILE" ]; then
|
|
aws s3 cp "$SCHEMA_BACKUP_FILE" "s3://${AWS_S3_BUCKET}/postgres/${TIMESTAMP}/schemas/" --storage-class STANDARD_IA
|
|
fi
|
|
done
|
|
|
|
echo "✓ Backup uploaded to S3"
|
|
fi
|
|
|
|
# 8. Cleanup Old Backups (Local)
|
|
echo "8. Cleaning up old backups (older than $RETENTION_DAYS days)..."
|
|
find "$BACKUP_DIR" -type f -name "*.dump" -mtime +$RETENTION_DAYS -delete
|
|
find "$BACKUP_DIR" -type f -name "*.sql" -mtime +$RETENTION_DAYS -delete
|
|
find "$BACKUP_DIR" -type f -name "*.md5" -mtime +$RETENTION_DAYS -delete
|
|
find "$BACKUP_DIR" -type d -name "config_*" -mtime +$RETENTION_DAYS -exec rm -rf {} + 2>/dev/null || true
|
|
echo "✓ Old backups cleaned up"
|
|
|
|
# 9. Verify Backup Integrity
|
|
echo "9. Verifying backup integrity..."
|
|
md5sum -c "${FULL_BACKUP_FILE}.md5"
|
|
if [ $? -eq 0 ]; then
|
|
echo "✓ Backup integrity verified"
|
|
else
|
|
echo "✗ Backup integrity check failed!"
|
|
exit 1
|
|
fi
|
|
|
|
# 10. Send Notification
|
|
echo "10. Sending backup notification..."
|
|
BACKUP_SIZE=$(du -sh "$BACKUP_DIR" | cut -f1)
|
|
|
|
# Slack notification (optional)
|
|
if [ -n "${SLACK_WEBHOOK_URL:-}" ]; then
|
|
curl -X POST "$SLACK_WEBHOOK_URL" \
|
|
-H 'Content-Type: application/json' \
|
|
-d "{\"text\": \"✅ PostgreSQL backup completed successfully\n• Database: $DB_NAME\n• Size: $FILE_SIZE\n• Timestamp: $TIMESTAMP\n• Total backup dir size: $BACKUP_SIZE\"}"
|
|
fi
|
|
|
|
echo "===== PostgreSQL Backup Completed at $(date) ====="
|
|
echo "Total backup size: $BACKUP_SIZE"
|
|
echo "Backup location: $BACKUP_DIR"
|
|
|
|
# Exit successfully
|
|
exit 0
|
|
```
|
|
|
|
### 3.2 Incremental Backup Script
|
|
|
|
**File:** `scripts/backup-postgres-incremental.sh`
|
|
|
|
```bash
|
|
#!/bin/bash
|
|
# =====================================================
|
|
# ERP GENERIC - PostgreSQL Incremental Backup Script
|
|
# Uses pg_dump with --snapshot for consistency
|
|
# =====================================================
|
|
|
|
set -euo pipefail
|
|
|
|
BACKUP_DIR="/backups/postgres/incremental"
|
|
TIMESTAMP=$(date +"%Y%m%d_%H%M%S")
|
|
DB_HOST="${POSTGRES_HOST:-postgres}"
|
|
DB_NAME="${POSTGRES_DB:-erp_generic}"
|
|
DB_USER="${POSTGRES_USER:-erp_user}"
|
|
PGPASSWORD="${POSTGRES_PASSWORD}"
|
|
export PGPASSWORD
|
|
|
|
mkdir -p "$BACKUP_DIR"
|
|
|
|
echo "===== Incremental Backup Started at $(date) ====="
|
|
|
|
# Get last full backup timestamp
|
|
LAST_FULL_BACKUP=$(ls -t /backups/postgres/full_*.dump | head -1 | grep -oP '\d{8}_\d{6}')
|
|
echo "Last full backup: $LAST_FULL_BACKUP"
|
|
|
|
# Incremental backup: Export only changed data (simplified approach)
|
|
# In production, consider using WAL-based incremental backups or pg_basebackup
|
|
|
|
INCREMENTAL_FILE="${BACKUP_DIR}/incremental_${TIMESTAMP}.dump"
|
|
pg_dump -h "$DB_HOST" -U "$DB_USER" -Fc -d "$DB_NAME" -f "$INCREMENTAL_FILE"
|
|
|
|
echo "✓ Incremental backup created: $INCREMENTAL_FILE"
|
|
echo "===== Incremental Backup Completed at $(date) ====="
|
|
```
|
|
|
|
### 3.3 WAL Archiving Configuration
|
|
|
|
**File:** `postgresql.conf` (WAL archiving section)
|
|
|
|
```ini
|
|
# WAL Settings for Point-in-Time Recovery
|
|
wal_level = replica
|
|
archive_mode = on
|
|
archive_command = 'test ! -f /backups/wal/%f && cp %p /backups/wal/%f && aws s3 cp /backups/wal/%f s3://${AWS_S3_BUCKET}/wal/'
|
|
archive_timeout = 60s
|
|
max_wal_senders = 3
|
|
wal_keep_size = 1GB
|
|
```
|
|
|
|
### 3.4 Cron Schedule
|
|
|
|
**File:** `/etc/cron.d/erp-backup`
|
|
|
|
```cron
|
|
# ERP Generic Backup Schedule
|
|
|
|
# Full backup daily at 2:00 AM
|
|
0 2 * * * root /opt/erp-generic/scripts/backup-postgres.sh >> /var/log/erp-generic/backup.log 2>&1
|
|
|
|
# Incremental backup every 4 hours
|
|
0 */4 * * * root /opt/erp-generic/scripts/backup-postgres-incremental.sh >> /var/log/erp-generic/backup.log 2>&1
|
|
|
|
# Verify backups daily at 4:00 AM
|
|
0 4 * * * root /opt/erp-generic/scripts/verify-backup.sh >> /var/log/erp-generic/backup.log 2>&1
|
|
|
|
# Cleanup old WAL files daily at 5:00 AM
|
|
0 5 * * * root find /backups/wal -type f -mtime +7 -delete
|
|
|
|
# Weekly full disaster recovery test (Sundays at 3:00 AM)
|
|
0 3 * * 0 root /opt/erp-generic/scripts/test-restore.sh >> /var/log/erp-generic/backup-test.log 2>&1
|
|
```
|
|
|
|
---
|
|
|
|
## 4. MULTI-TENANT BACKUP ISOLATION
|
|
|
|
### 4.1 Per-Tenant Backup Strategy
|
|
|
|
**Why Per-Schema Backups?**
|
|
- Restore individual tenant without affecting others
|
|
- Compliance: GDPR right to erasure (delete tenant data)
|
|
- Tenant migration to dedicated instance
|
|
- Faster restore times for single tenant issues
|
|
|
|
**Backup Structure:**
|
|
```
|
|
/backups/postgres/
|
|
├── full_20251124_020000.dump # All schemas
|
|
├── auth_20251124_020000.dump # Auth schema only
|
|
├── core_20251124_020000.dump # Core schema only
|
|
├── financial_20251124_020000.dump # Financial schema only
|
|
├── inventory_20251124_020000.dump # Inventory schema only
|
|
├── purchase_20251124_020000.dump # Purchase schema only
|
|
├── sales_20251124_020000.dump # Sales schema only
|
|
├── analytics_20251124_020000.dump # Analytics schema only
|
|
├── projects_20251124_020000.dump # Projects schema only
|
|
└── system_20251124_020000.dump # System schema only
|
|
```
|
|
|
|
### 4.2 Restore Single Tenant
|
|
|
|
```bash
|
|
#!/bin/bash
|
|
# Restore single tenant (schema isolation)
|
|
|
|
TENANT_ID="tenant-abc"
|
|
SCHEMA_NAME="financial" # Example: restore financial schema only
|
|
BACKUP_FILE="/backups/postgres/financial_20251124_020000.dump"
|
|
|
|
echo "Restoring schema $SCHEMA_NAME for tenant $TENANT_ID..."
|
|
|
|
# Option 1: Drop and restore entire schema
|
|
psql -h postgres -U erp_user -d erp_generic -c "DROP SCHEMA IF EXISTS $SCHEMA_NAME CASCADE;"
|
|
pg_restore -h postgres -U erp_user -d erp_generic -n $SCHEMA_NAME --clean --if-exists "$BACKUP_FILE"
|
|
|
|
# Option 2: Restore to temporary schema, then copy tenant-specific data
|
|
pg_restore -h postgres -U erp_user -d erp_generic -n ${SCHEMA_NAME}_temp --create "$BACKUP_FILE"
|
|
|
|
# Copy tenant-specific data
|
|
psql -h postgres -U erp_user -d erp_generic <<SQL
|
|
-- Copy tenant data from temp schema
|
|
INSERT INTO $SCHEMA_NAME.accounts
|
|
SELECT * FROM ${SCHEMA_NAME}_temp.accounts WHERE tenant_id = '$TENANT_ID';
|
|
|
|
-- Drop temp schema
|
|
DROP SCHEMA ${SCHEMA_NAME}_temp CASCADE;
|
|
SQL
|
|
|
|
echo "✓ Tenant $TENANT_ID restored from $SCHEMA_NAME schema"
|
|
```
|
|
|
|
---
|
|
|
|
## 5. RETENTION POLICY
|
|
|
|
### 5.1 Retention Rules
|
|
|
|
**Local Storage (Fast Recovery):**
|
|
- **Full Backups:** 7 days
|
|
- **Incremental Backups:** 7 days
|
|
- **WAL Archives:** 7 days
|
|
- **Config Backups:** 30 days
|
|
|
|
**Cloud Storage (Long-term Retention):**
|
|
- **Daily Backups:** 30 days (Standard storage)
|
|
- **Weekly Backups:** 12 weeks (Standard-IA storage)
|
|
- **Monthly Backups:** 12 months (Glacier storage)
|
|
- **Annual Backups:** 7 years (Glacier Deep Archive) - Compliance
|
|
|
|
**Lifecycle Policy (AWS S3 Example):**
|
|
```json
|
|
{
|
|
"Rules": [
|
|
{
|
|
"Id": "TransitionToIA",
|
|
"Status": "Enabled",
|
|
"Transitions": [
|
|
{
|
|
"Days": 30,
|
|
"StorageClass": "STANDARD_IA"
|
|
}
|
|
]
|
|
},
|
|
{
|
|
"Id": "TransitionToGlacier",
|
|
"Status": "Enabled",
|
|
"Transitions": [
|
|
{
|
|
"Days": 90,
|
|
"StorageClass": "GLACIER"
|
|
}
|
|
]
|
|
},
|
|
{
|
|
"Id": "DeleteOldBackups",
|
|
"Status": "Enabled",
|
|
"Expiration": {
|
|
"Days": 365
|
|
},
|
|
"Filter": {
|
|
"Prefix": "postgres/daily/"
|
|
}
|
|
}
|
|
]
|
|
}
|
|
```
|
|
|
|
### 5.2 Compliance Requirements
|
|
|
|
**GDPR (EU):**
|
|
- Right to erasure: Ability to delete tenant data
|
|
- Backup encryption: AES-256
|
|
- Access logging: Who accessed what backup
|
|
|
|
**SOC 2:**
|
|
- Backup testing: Monthly restore drills
|
|
- Access controls: Role-based access to backups
|
|
- Audit trail: Log all backup/restore operations
|
|
|
|
**Industry-Specific:**
|
|
- Healthcare (HIPAA): 6 years retention
|
|
- Financial (SOX): 7 years retention
|
|
- Government: Variable (check local regulations)
|
|
|
|
---
|
|
|
|
## 6. RECOVERY PROCEDURES
|
|
|
|
### 6.1 Full Database Restore
|
|
|
|
**File:** `scripts/restore-postgres.sh`
|
|
|
|
```bash
|
|
#!/bin/bash
|
|
# =====================================================
|
|
# ERP GENERIC - PostgreSQL Restore Script
|
|
# Restores database from backup file
|
|
# =====================================================
|
|
|
|
set -euo pipefail
|
|
|
|
# Usage
|
|
if [ $# -lt 1 ]; then
|
|
echo "Usage: $0 <backup_file> [--target=<database>] [--no-prompt]"
|
|
echo "Example: $0 /backups/postgres/full_20251124_020000.dump --target=erp_generic_staging"
|
|
exit 1
|
|
fi
|
|
|
|
BACKUP_FILE="$1"
|
|
TARGET_DB="${2:-erp_generic}"
|
|
NO_PROMPT=false
|
|
|
|
# Parse arguments
|
|
for arg in "$@"; do
|
|
case $arg in
|
|
--target=*)
|
|
TARGET_DB="${arg#*=}"
|
|
shift
|
|
;;
|
|
--no-prompt)
|
|
NO_PROMPT=true
|
|
shift
|
|
;;
|
|
esac
|
|
done
|
|
|
|
DB_HOST="${POSTGRES_HOST:-postgres}"
|
|
DB_USER="${POSTGRES_USER:-erp_user}"
|
|
PGPASSWORD="${POSTGRES_PASSWORD}"
|
|
export PGPASSWORD
|
|
|
|
echo "===== PostgreSQL Restore Started at $(date) ====="
|
|
echo "Backup file: $BACKUP_FILE"
|
|
echo "Target database: $TARGET_DB"
|
|
|
|
# Verify backup file exists
|
|
if [ ! -f "$BACKUP_FILE" ]; then
|
|
echo "✗ Backup file not found: $BACKUP_FILE"
|
|
exit 1
|
|
fi
|
|
|
|
# Verify checksum if exists
|
|
if [ -f "${BACKUP_FILE}.md5" ]; then
|
|
echo "Verifying backup integrity..."
|
|
md5sum -c "${BACKUP_FILE}.md5"
|
|
if [ $? -ne 0 ]; then
|
|
echo "✗ Backup integrity check failed!"
|
|
exit 1
|
|
fi
|
|
echo "✓ Backup integrity verified"
|
|
fi
|
|
|
|
# Safety prompt (unless --no-prompt)
|
|
if [ "$NO_PROMPT" = false ]; then
|
|
echo ""
|
|
echo "⚠️ WARNING: This will OVERWRITE all data in database '$TARGET_DB'"
|
|
echo "⚠️ Make sure you have a recent backup before proceeding!"
|
|
echo ""
|
|
read -p "Are you sure you want to continue? (yes/no): " -r
|
|
if [[ ! $REPLY =~ ^[Yy][Ee][Ss]$ ]]; then
|
|
echo "Restore cancelled by user."
|
|
exit 0
|
|
fi
|
|
fi
|
|
|
|
# Create safety backup of target database (if not staging/dev)
|
|
if [[ ! "$TARGET_DB" =~ (staging|dev|test) ]]; then
|
|
echo "Creating safety backup of $TARGET_DB before restore..."
|
|
SAFETY_BACKUP="/backups/postgres/safety_${TARGET_DB}_$(date +%Y%m%d_%H%M%S).dump"
|
|
pg_dump -h "$DB_HOST" -U "$DB_USER" -Fc -d "$TARGET_DB" -f "$SAFETY_BACKUP"
|
|
echo "✓ Safety backup created: $SAFETY_BACKUP"
|
|
fi
|
|
|
|
# Terminate active connections to target database
|
|
echo "Terminating active connections to $TARGET_DB..."
|
|
psql -h "$DB_HOST" -U "$DB_USER" -d postgres <<SQL
|
|
SELECT pg_terminate_backend(pid)
|
|
FROM pg_stat_activity
|
|
WHERE datname = '$TARGET_DB' AND pid <> pg_backend_pid();
|
|
SQL
|
|
|
|
# Restore database
|
|
echo "Restoring database from $BACKUP_FILE..."
|
|
pg_restore -h "$DB_HOST" -U "$DB_USER" -d "$TARGET_DB" --clean --if-exists --verbose "$BACKUP_FILE"
|
|
|
|
if [ $? -eq 0 ]; then
|
|
echo "✓ Database restored successfully"
|
|
else
|
|
echo "✗ Restore failed!"
|
|
exit 1
|
|
fi
|
|
|
|
# Verify restore
|
|
echo "Verifying restore..."
|
|
TABLE_COUNT=$(psql -h "$DB_HOST" -U "$DB_USER" -d "$TARGET_DB" -tAc "SELECT COUNT(*) FROM information_schema.tables WHERE table_schema NOT IN ('pg_catalog', 'information_schema');")
|
|
USER_COUNT=$(psql -h "$DB_HOST" -U "$DB_USER" -d "$TARGET_DB" -tAc "SELECT COUNT(*) FROM auth.users;" 2>/dev/null || echo "N/A")
|
|
|
|
echo "Tables restored: $TABLE_COUNT"
|
|
echo "Users in auth.users: $USER_COUNT"
|
|
|
|
# Rebuild statistics
|
|
echo "Rebuilding database statistics..."
|
|
psql -h "$DB_HOST" -U "$DB_USER" -d "$TARGET_DB" -c "ANALYZE;"
|
|
echo "✓ Statistics rebuilt"
|
|
|
|
echo "===== PostgreSQL Restore Completed at $(date) ====="
|
|
echo ""
|
|
echo "Next steps:"
|
|
echo "1. Verify application functionality"
|
|
echo "2. Run smoke tests: npm run test:smoke"
|
|
echo "3. Check logs for errors"
|
|
echo "4. Notify team that restore is complete"
|
|
|
|
exit 0
|
|
```
|
|
|
|
### 6.2 Schema-Level Restore
|
|
|
|
```bash
|
|
# Restore single schema
|
|
SCHEMA="financial"
|
|
BACKUP_FILE="/backups/postgres/financial_20251124_020000.dump"
|
|
|
|
pg_restore -h postgres -U erp_user -d erp_generic -n $SCHEMA --clean --if-exists "$BACKUP_FILE"
|
|
```
|
|
|
|
### 6.3 Table-Level Restore
|
|
|
|
```bash
|
|
# Restore single table
|
|
TABLE="auth.users"
|
|
BACKUP_FILE="/backups/postgres/full_20251124_020000.dump"
|
|
|
|
pg_restore -h postgres -U erp_user -d erp_generic -t $TABLE --data-only "$BACKUP_FILE"
|
|
```
|
|
|
|
---
|
|
|
|
## 7. POINT-IN-TIME RECOVERY (PITR)
|
|
|
|
### 7.1 PITR Process
|
|
|
|
**Use Case:** Restore database to specific point in time (e.g., before accidental DELETE query)
|
|
|
|
**Requirements:**
|
|
- Base backup (full backup)
|
|
- WAL archives from base backup to target time
|
|
|
|
**Steps:**
|
|
|
|
```bash
|
|
#!/bin/bash
|
|
# Point-in-Time Recovery (PITR)
|
|
|
|
TARGET_TIME="2025-11-24 14:30:00"
|
|
BASE_BACKUP="/backups/postgres/full_20251124_020000.dump"
|
|
WAL_ARCHIVE_DIR="/backups/wal"
|
|
|
|
echo "===== Point-in-Time Recovery to $TARGET_TIME ====="
|
|
|
|
# 1. Stop PostgreSQL
|
|
docker-compose stop postgres
|
|
|
|
# 2. Backup current data (safety)
|
|
mv /data/postgres /data/postgres_backup_$(date +%Y%m%d_%H%M%S)
|
|
|
|
# 3. Restore base backup
|
|
mkdir -p /data/postgres
|
|
pg_basebackup -h postgres -U erp_user -D /data/postgres -Fp -Xs -P
|
|
|
|
# 4. Create recovery configuration
|
|
cat > /data/postgres/recovery.conf <<EOF
|
|
restore_command = 'cp /backups/wal/%f %p'
|
|
recovery_target_time = '$TARGET_TIME'
|
|
recovery_target_action = 'promote'
|
|
EOF
|
|
|
|
# 5. Start PostgreSQL (will enter recovery mode)
|
|
docker-compose start postgres
|
|
|
|
# 6. Monitor recovery
|
|
tail -f /var/log/postgresql/postgresql-*.log | grep -i recovery
|
|
|
|
echo "Recovery complete! Database restored to $TARGET_TIME"
|
|
```
|
|
|
|
### 7.2 PITR Best Practices
|
|
|
|
- **Test PITR quarterly** in non-production environment
|
|
- **Document recovery times** (usually 30 min - 2 hours depending on WAL size)
|
|
- **Automate WAL archiving** to S3/cloud storage
|
|
- **Monitor WAL archive lag** (should be <5 minutes)
|
|
|
|
---
|
|
|
|
## 8. DISASTER RECOVERY PLAYBOOK
|
|
|
|
### 8.1 Disaster Scenarios
|
|
|
|
**Scenario 1: Complete Data Center Failure**
|
|
|
|
**Symptoms:**
|
|
- All servers unreachable
|
|
- Network connectivity lost
|
|
- Physical infrastructure damaged
|
|
|
|
**Recovery Steps:**
|
|
|
|
```bash
|
|
# 1. Provision new infrastructure (AWS, Azure, GCP)
|
|
terraform apply -var-file=disaster-recovery.tfvars
|
|
|
|
# 2. Download latest backup from cloud storage
|
|
aws s3 sync s3://erp-generic-backups-prod/postgres/ /backups/postgres/ --exclude "*" --include "full_*"
|
|
|
|
# 3. Deploy Docker containers
|
|
docker-compose -f docker-compose.prod.yml up -d postgres redis
|
|
|
|
# 4. Restore database
|
|
LATEST_BACKUP=$(ls -t /backups/postgres/full_*.dump | head -1)
|
|
./scripts/restore-postgres.sh "$LATEST_BACKUP" --no-prompt
|
|
|
|
# 5. Restore WAL archives for PITR
|
|
aws s3 sync s3://erp-generic-backups-prod/wal/ /backups/wal/
|
|
|
|
# 6. Deploy application
|
|
docker-compose -f docker-compose.prod.yml up -d backend frontend nginx
|
|
|
|
# 7. Update DNS (point to new infrastructure)
|
|
# Manual step or use Route53/CloudFlare API
|
|
|
|
# 8. Verify functionality
|
|
./scripts/health-check.sh
|
|
npm run test:smoke
|
|
|
|
# 9. Notify stakeholders
|
|
# Send email/Slack notification
|
|
```
|
|
|
|
**Estimated RTO:** 4 hours
|
|
**Estimated RPO:** 15 minutes (last WAL archive)
|
|
|
|
---
|
|
|
|
**Scenario 2: Accidental Data Deletion**
|
|
|
|
**Symptoms:**
|
|
- Critical data deleted by user error
|
|
- "DELETE FROM sales.orders WHERE..." without WHERE clause
|
|
|
|
**Recovery Steps:**
|
|
|
|
```bash
|
|
# 1. Identify deletion time
|
|
SELECT * FROM system.audit_logs WHERE event = 'DELETE' AND table_name = 'sales.orders' ORDER BY created_at DESC LIMIT 10;
|
|
|
|
# 2. Use PITR to restore to before deletion
|
|
./scripts/pitr-restore.sh --target-time="2025-11-24 14:25:00"
|
|
|
|
# 3. Export deleted records from restored database
|
|
psql -h restored-db -U erp_user -d erp_generic -c "COPY (SELECT * FROM sales.orders WHERE created_at >= '2025-11-24 14:00:00') TO '/tmp/recovered_orders.csv' CSV HEADER;"
|
|
|
|
# 4. Import recovered records to production
|
|
psql -h postgres -U erp_user -d erp_generic -c "\COPY sales.orders FROM '/tmp/recovered_orders.csv' CSV HEADER;"
|
|
|
|
echo "✓ Deleted records recovered"
|
|
```
|
|
|
|
**Estimated RTO:** 1 hour
|
|
**Estimated RPO:** 0 (no data loss if caught quickly)
|
|
|
|
---
|
|
|
|
**Scenario 3: Database Corruption**
|
|
|
|
**Symptoms:**
|
|
- PostgreSQL fails to start
|
|
- "corrupt page" errors in logs
|
|
- Data inconsistencies
|
|
|
|
**Recovery Steps:**
|
|
|
|
```bash
|
|
# 1. Attempt automatic repair
|
|
docker-compose exec postgres pg_resetwal /var/lib/postgresql/data
|
|
|
|
# 2. If repair fails, restore from backup
|
|
./scripts/restore-postgres.sh /backups/postgres/full_20251124_020000.dump
|
|
|
|
# 3. Run VACUUM and ANALYZE
|
|
psql -h postgres -U erp_user -d erp_generic -c "VACUUM FULL; ANALYZE;"
|
|
|
|
# 4. Rebuild indexes
|
|
psql -h postgres -U erp_user -d erp_generic -c "REINDEX DATABASE erp_generic;"
|
|
```
|
|
|
|
---
|
|
|
|
## 9. BACKUP TESTING
|
|
|
|
### 9.1 Monthly Restore Test
|
|
|
|
**File:** `scripts/test-restore.sh`
|
|
|
|
```bash
|
|
#!/bin/bash
|
|
# Monthly backup restore test
|
|
|
|
BACKUP_FILE=$(ls -t /backups/postgres/full_*.dump | head -1)
|
|
TEST_DB="erp_generic_restore_test"
|
|
|
|
echo "===== Backup Restore Test ====="
|
|
echo "Backup: $BACKUP_FILE"
|
|
|
|
# 1. Drop test database if exists
|
|
psql -h postgres -U erp_user -d postgres -c "DROP DATABASE IF EXISTS $TEST_DB;"
|
|
|
|
# 2. Create test database
|
|
psql -h postgres -U erp_user -d postgres -c "CREATE DATABASE $TEST_DB;"
|
|
|
|
# 3. Restore backup
|
|
pg_restore -h postgres -U erp_user -d $TEST_DB --clean --if-exists "$BACKUP_FILE"
|
|
|
|
# 4. Run smoke tests
|
|
psql -h postgres -U erp_user -d $TEST_DB <<SQL
|
|
-- Verify table counts
|
|
SELECT COUNT(*) AS user_count FROM auth.users;
|
|
SELECT COUNT(*) AS partner_count FROM core.partners;
|
|
SELECT COUNT(*) AS order_count FROM sales.orders;
|
|
|
|
-- Verify data integrity
|
|
SELECT 'OK' AS status WHERE (
|
|
SELECT COUNT(*) FROM auth.users
|
|
) > 0;
|
|
SQL
|
|
|
|
# 5. Cleanup
|
|
psql -h postgres -U erp_user -d postgres -c "DROP DATABASE $TEST_DB;"
|
|
|
|
echo "✓ Backup restore test passed"
|
|
```
|
|
|
|
### 9.2 Quarterly DR Drill
|
|
|
|
**Checklist:**
|
|
- [ ] Provision new infrastructure (staging environment)
|
|
- [ ] Restore from cloud backup (S3)
|
|
- [ ] Verify all 9 schemas restored
|
|
- [ ] Run full test suite (unit + integration + E2E)
|
|
- [ ] Measure RTO (actual time to restore)
|
|
- [ ] Measure RPO (data loss amount)
|
|
- [ ] Document findings and improvements
|
|
- [ ] Update DR playbook
|
|
|
|
**Success Criteria:**
|
|
- RTO < 4 hours
|
|
- RPO < 15 minutes
|
|
- All tests passing
|
|
- Zero critical data loss
|
|
|
|
---
|
|
|
|
## 10. REFERENCES
|
|
|
|
**Internal Documentation:**
|
|
- [Deployment Guide](./DEPLOYMENT-GUIDE.md)
|
|
- [Monitoring & Observability](./MONITORING-OBSERVABILITY.md)
|
|
- [Database Schemas](../02-modelado/database-design/schemas/)
|
|
|
|
**External Resources:**
|
|
- [PostgreSQL Backup Documentation](https://www.postgresql.org/docs/16/backup.html)
|
|
- [PostgreSQL PITR](https://www.postgresql.org/docs/16/continuous-archiving.html)
|
|
- [AWS RDS Backup Best Practices](https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/USER_WorkingWithAutomatedBackups.html)
|
|
|
|
---
|
|
|
|
**Documento:** BACKUP-RECOVERY.md
|
|
**Versión:** 1.0
|
|
**Total Páginas:** ~12
|
|
**Última Actualización:** 2025-11-24
|