erp-core/docs/07-devops/README.md

443 lines
13 KiB
Markdown

# DevOps Documentation - ERP Generic
**Última actualización:** 2025-11-24
**Responsable:** DevOps Team
**Estado:** ✅ Production-Ready
---
## 1. OVERVIEW
Esta carpeta contiene toda la documentación DevOps necesaria para desplegar, monitorear, mantener y asegurar el ERP Generic en ambientes de producción.
El ERP Generic es un sistema modular con:
- **14 módulos** (MGN-001 a MGN-014)
- **Stack:** NestJS 10 + Prisma 5 + PostgreSQL 16 + Redis 7 + React 18 + TypeScript 5
- **Multi-tenancy:** Schema-level isolation + Row-Level Security (RLS)
- **9 schemas PostgreSQL:** auth, core, financial, inventory, purchase, sales, analytics, projects, system
- **Arquitectura:** Microservices-ready, Cloud-native, Container-based
---
## 2. DOCUMENTOS PRINCIPALES
### 2.1 [DEPLOYMENT-GUIDE.md](./DEPLOYMENT-GUIDE.md)
**Propósito:** Guía completa de deployment en todos los ambientes.
**Contenido:**
- Docker setup completo (Dockerfile + docker-compose.yml)
- PostgreSQL 16 initialization (9 schemas)
- Redis configuration
- Environment variables management
- Multi-environment deployment strategy (Dev, QA, Staging, Production)
- Zero-downtime deployment (Blue-green)
- Rollback procedures
**Audiencia:** DevOps Engineers, SREs, Infrastructure Team
**Tiempo de implementación:** 4-6 horas primera vez, 15-30 min deployments posteriores
---
### 2.2 [MONITORING-OBSERVABILITY.md](./MONITORING-OBSERVABILITY.md)
**Propósito:** Estrategia completa de monitoring y observability.
**Contenido:**
- Prometheus setup (metrics collection)
- Grafana dashboards (Application, Database, Business)
- Alert rules (CPU, memoria, DB connections, error rate)
- Logging strategy (Winston + ELK/Loki)
- Application Performance Monitoring (APM)
- Health checks endpoints
- Distributed tracing (OpenTelemetry)
**Audiencia:** DevOps Engineers, SREs, On-call Engineers
**Tiempo de implementación:** 6-8 horas setup inicial
---
### 2.3 [BACKUP-RECOVERY.md](./BACKUP-RECOVERY.md)
**Propósito:** Procedimientos de backup y disaster recovery.
**Contenido:**
- Backup strategy (Full + incremental)
- Automated backup scripts (PostgreSQL multi-schema)
- Multi-tenant backup isolation
- Retention policies (7 días + 4 semanas + 12 meses)
- Point-in-Time Recovery (PITR)
- Disaster recovery playbook (RTO 4h, RPO 15min)
- Backup testing procedures
**Audiencia:** DevOps Engineers, DBAs, Security Team
**Tiempo de implementación:** 4-5 horas setup inicial, testing mensual
---
### 2.4 [SECURITY-HARDENING.md](./SECURITY-HARDENING.md)
**Propósito:** Hardening de seguridad completo del sistema.
**Contenido:**
- OWASP Top 10 mitigations
- Rate limiting configuration
- JWT security (rotation, expiration, refresh tokens)
- SQL injection prevention
- XSS/CSRF protection
- CORS configuration
- Security headers (Helmet.js)
- Secrets management (Vault/AWS Secrets Manager)
- SSL/TLS certificate management
**Audiencia:** Security Team, DevOps Engineers, Backend Developers
**Tiempo de implementación:** 8-10 horas implementación completa
---
### 2.5 [CI-CD-PIPELINE.md](./CI-CD-PIPELINE.md)
**Propósito:** Pipeline completo de integración y deployment continuo.
**Contenido:**
- GitHub Actions workflows (CI, CD-QA, CD-Production)
- Automated testing integration (Jest + Vitest + Playwright)
- Code quality gates (SonarQube)
- Security scanning (Snyk + OWASP Dependency Check)
- Docker build & push to registry
- Automated deployment (QA auto, Production manual approval)
- Rollback automation
- Notifications (Slack/Discord)
**Audiencia:** DevOps Engineers, Tech Lead, Development Team
**Tiempo de implementación:** 10-12 horas setup inicial
---
## 3. SCRIPTS
### 3.1 [scripts/backup-postgres.sh](./scripts/backup-postgres.sh)
Script automatizado de backup de PostgreSQL con soporte multi-tenant.
**Características:**
- Full backup + per-schema backups
- Compresión automática
- Retention policy (7 días)
- Upload opcional a S3/Cloud Storage
- Logging y notificaciones
**Ejecución:** Cron diario a las 2:00 AM
```bash
0 2 * * * /opt/erp-generic/scripts/backup-postgres.sh
```
---
### 3.2 [scripts/restore-postgres.sh](./scripts/restore-postgres.sh)
Script de restauración de backups con validación y verificación.
**Características:**
- Restauración full o por schema
- Validación de integridad antes de restaurar
- Backup safety (crea snapshot antes de restaurar)
- Dry-run mode para testing
- Logging detallado
**Ejecución:** Manual (disaster recovery)
```bash
./restore-postgres.sh --backup=full_20251124_020000.dump --target=staging
```
---
### 3.3 [scripts/health-check.sh](./scripts/health-check.sh)
Script de health check completo del sistema.
**Características:**
- Verifica backend API (/health)
- Verifica PostgreSQL (conexión + queries)
- Verifica Redis (conexión + ping)
- Verifica frontend (HTTP 200)
- Exit codes para monitoreo
- Logging estructurado
**Ejecución:** Cron cada 5 minutos + usado por Kubernetes liveness/readiness probes
```bash
*/5 * * * * /opt/erp-generic/scripts/health-check.sh
```
---
## 4. QUICK START
### Primer Deployment (Fresh Install)
```bash
# 1. Clone repository
git clone https://github.com/company/erp-generic.git
cd erp-generic
# 2. Configure environment variables
cp .env.example .env
# Editar .env con valores reales
# 3. Start services with Docker Compose
docker-compose up -d
# 4. Run database migrations
docker-compose exec backend npm run prisma:migrate:deploy
# 5. Seed initial data
docker-compose exec backend npm run seed:initial
# 6. Verify health
./scripts/health-check.sh
```
**Tiempo total:** 15-20 minutos
---
### Update Deployment (Existing System)
```bash
# 1. Pull latest changes
git pull origin main
# 2. Backup database (safety)
./scripts/backup-postgres.sh
# 3. Build new images
docker-compose build
# 4. Run migrations (zero-downtime)
docker-compose exec backend npm run prisma:migrate:deploy
# 5. Rolling update (zero-downtime)
docker-compose up -d --no-deps --build backend
docker-compose up -d --no-deps --build frontend
# 6. Verify health
./scripts/health-check.sh
# 7. Run smoke tests
npm run test:smoke
```
**Tiempo total:** 5-10 minutos
---
## 5. AMBIENTES
| Ambiente | URL | Deploy Method | Database | Purpose |
|----------|-----|---------------|----------|---------|
| **Development** | http://localhost:3000 | Manual (local) | PostgreSQL local | Local development |
| **CI/CD** | - | Auto (GitHub Actions) | PostgreSQL (TestContainers) | Automated testing |
| **QA** | https://qa.erp-generic.local | Auto (push to develop) | PostgreSQL (anonymized prod) | Manual QA testing |
| **Staging** | https://staging.erp-generic.com | Manual (approval) | PostgreSQL (prod clone) | Pre-release validation |
| **Production** | https://erp-generic.com | Manual (approval) | PostgreSQL (prod) | Live system |
---
## 6. SLA Y OBJETIVOS
### 6.1 Availability Targets
- **Uptime:** 99.9% (8.76 horas downtime/año máximo)
- **Planned Maintenance Window:** Sábados 2:00-4:00 AM (notificación 48h antes)
- **Unplanned Downtime:** <30 min/mes
### 6.2 Performance Targets
- **API Response Time:** p50 <100ms, p95 <300ms, p99 <500ms
- **Page Load Time:** p95 <2s (First Contentful Paint)
- **Database Query Time:** p95 <50ms
- **Throughput:** >1000 req/s @ peak load
### 6.3 Recovery Targets
- **RTO (Recovery Time Objective):** 4 horas
- **RPO (Recovery Point Objective):** 15 minutos
- **Backup Frequency:** Full daily (2:00 AM) + Incremental every 4 hours
- **Backup Retention:** 7 daily + 4 weekly + 12 monthly
### 6.4 Security Targets
- **Critical Vulnerabilities:** Fix within 24 hours
- **High Vulnerabilities:** Fix within 7 días
- **Security Scans:** Daily (automated in CI/CD)
- **Penetration Testing:** Quarterly (external vendor)
- **Security Audits:** Bi-annual (compliance)
---
## 7. INCIDENT RESPONSE
### 7.1 Severity Levels
| Severity | Description | Response Time | Resolution Time |
|----------|-------------|---------------|-----------------|
| **P0 (Critical)** | System down, data loss | 15 min | 4 horas |
| **P1 (High)** | Major feature broken | 1 hora | 24 horas |
| **P2 (Medium)** | Minor feature broken | 4 horas | 72 horas |
| **P3 (Low)** | Cosmetic issue | 24 horas | Next sprint |
### 7.2 On-Call Rotation
- **Primary On-Call:** DevOps Engineer (24/7)
- **Secondary On-Call:** Backend Tech Lead
- **Escalation:** CTO
### 7.3 Incident Procedure
1. **Detection:** Alerts via Prometheus/Grafana → PagerDuty
2. **Acknowledge:** On-call engineer acknowledges within 15 min
3. **Assess:** Determine severity level
4. **Mitigate:** Apply immediate fix or rollback
5. **Communicate:** Update status page + notify stakeholders
6. **Resolve:** Permanent fix deployed
7. **Post-Mortem:** Document lessons learned (dentro de 48h)
---
## 8. MAINTENANCE WINDOWS
### 8.1 Regular Maintenance
**Frecuencia:** Mensual (primer sábado del mes)
**Horario:** 2:00-4:00 AM (timezone del servidor)
**Notificación:** 48 horas antes vía email + banner en sistema
**Actividades típicas:**
- Database maintenance (VACUUM, ANALYZE, REINDEX)
- SSL certificate renewal
- OS security patches
- PostgreSQL minor version updates
- Log rotation y cleanup
### 8.2 Emergency Maintenance
**Criterio:** Critical security vulnerability (P0)
**Notificación:** 2 horas antes (mínimo)
**Aprobación:** CTO + Product Owner
---
## 9. CONTACT INFORMATION
### 9.1 Teams
**DevOps Team:**
- Email: devops@erp-generic.com
- Slack: #devops-team
- On-Call: +1-XXX-XXX-XXXX (PagerDuty)
**Security Team:**
- Email: security@erp-generic.com
- Slack: #security-alerts
- Incident: security-incident@erp-generic.com
**Database Team:**
- Email: dba@erp-generic.com
- Slack: #database-team
**Development Team:**
- Email: dev@erp-generic.com
- Slack: #development
### 9.2 Escalation Path
1. **L1:** On-Call DevOps Engineer
2. **L2:** Backend Tech Lead + DBA
3. **L3:** CTO + Infrastructure Manager
4. **L4:** CEO (only for business-critical incidents)
---
## 10. TOOLS Y ACCESOS
### 10.1 Infrastructure
- **Cloud Provider:** AWS / Azure / GCP (TBD)
- **Container Registry:** Docker Hub / AWS ECR / GitHub Container Registry
- **CI/CD:** GitHub Actions
- **Secrets Management:** HashiCorp Vault / AWS Secrets Manager
### 10.2 Monitoring & Observability
- **APM:** Prometheus + Grafana
- **Logging:** Winston + ELK Stack (Elasticsearch + Logstash + Kibana) / Grafana Loki
- **Alerting:** Prometheus Alertmanager → PagerDuty
- **Uptime Monitoring:** UptimeRobot / Pingdom
- **Error Tracking:** Sentry
### 10.3 Security
- **SAST:** Snyk, SonarQube
- **DAST:** OWASP ZAP
- **Dependency Scanning:** Snyk, npm audit
- **Secret Scanning:** GitGuardian, TruffleHog
- **Penetration Testing:** External vendor (quarterly)
### 10.4 Collaboration
- **Project Management:** Jira
- **Documentation:** Confluence
- **Chat:** Slack / Microsoft Teams
- **Video:** Zoom / Google Meet
- **On-Call:** PagerDuty
---
## 11. COMPLIANCE & AUDITING
### 11.1 Standards
- **GDPR:** Data protection and privacy (EU)
- **CCPA:** California Consumer Privacy Act
- **SOC 2 Type II:** Security, availability, processing integrity (target)
- **ISO 27001:** Information security management (target)
### 11.2 Audit Logs
- **Database Audit:** pgaudit extension enabled
- **Application Audit:** Winston structured logging + ELK
- **Infrastructure Audit:** AWS CloudTrail / Azure Activity Log
- **Retention:** 1 año (compliance requirement)
### 11.3 Data Residency
- **Primary Region:** us-east-1 (Virginia) / eu-west-1 (Ireland) - TBD
- **Backup Region:** us-west-2 (Oregon) / eu-central-1 (Frankfurt) - TBD
- **Data Sovereignty:** EU data stays in EU (GDPR compliance)
---
## 12. CHANGELOG
| Versión | Fecha | Autor | Cambios |
|---------|-------|-------|---------|
| 1.0 | 2025-11-24 | DevOps Architect | Documentación inicial completa |
| | | | |
| | | | |
---
## 13. REFERENCIAS
**Documentación Relacionada:**
- [Test Plans](../04-test-plans/MASTER-TEST-PLAN.md)
- [Architecture Decision Records](../adr/)
- [Database Schemas](../02-modelado/database-design/schemas/)
- [User Stories](../03-user-stories/)
**Referencias Externas:**
- [Docker Documentation](https://docs.docker.com/)
- [PostgreSQL 16 Documentation](https://www.postgresql.org/docs/16/)
- [Prometheus Documentation](https://prometheus.io/docs/)
- [Grafana Documentation](https://grafana.com/docs/)
- [OWASP Top 10](https://owasp.org/www-project-top-ten/)
- [12-Factor App Methodology](https://12factor.net/)
---
## 14. LICENCIA Y COPYRIGHT
**Copyright © 2025 ERP Generic Team. All rights reserved.**
Esta documentación es confidencial y está destinada únicamente para uso interno del equipo de desarrollo y operaciones del ERP Generic.
**Clasificación:** Internal Use Only
**Retención:** Permanent (actualizar con cada release)
---
**Documento:** README.md
**Ubicación:** `/projects/erp-generic/docs/05-devops/`
**Próxima Revisión:** 2025-12-24 (mensual)