# MONITORING & OBSERVABILITY - ERP Generic **Última actualización:** 2025-11-24 **Responsable:** DevOps Team / SRE Team **Estado:** ✅ Production-Ready --- ## TABLE OF CONTENTS 1. [Overview](#1-overview) 2. [Observability Pillars](#2-observability-pillars) 3. [Prometheus Setup](#3-prometheus-setup) 4. [Grafana Dashboards](#4-grafana-dashboards) 5. [Alert Rules](#5-alert-rules) 6. [Logging Strategy](#6-logging-strategy) 7. [Application Performance Monitoring (APM)](#7-application-performance-monitoring-apm) 8. [Health Checks](#8-health-checks) 9. [Distributed Tracing](#9-distributed-tracing) 10. [On-Call & Incident Response](#10-on-call--incident-response) 11. [References](#11-references) --- ## 1. OVERVIEW ### 1.1 Monitoring Architecture ``` ┌─────────────────────────────────────────────────────────────────────┐ │ Application Layer │ │ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐ │ │ │ Backend │ │ Frontend │ │ Postgres │ │ Redis │ │ │ │ (Metrics)│ │ (Metrics)│ │(Exporter)│ │(Exporter)│ │ │ └────┬─────┘ └────┬─────┘ └────┬─────┘ └────┬─────┘ │ │ │ │ │ │ │ │ └─────────────┴─────────────┴─────────────┘ │ │ │ │ └────────────────────────────┼─────────────────────────────────────────┘ │ (Scrape metrics every 15s) ↓ ┌─────────────────────────────────────────────────────────────────────┐ │ Prometheus (TSDB) │ │ - Collects metrics from all targets │ │ - Evaluates alert rules │ │ - Stores time-series data (15 days retention) │ └────────┬────────────────────────────────┬─────────────────────────┘ │ │ │ (Query metrics) │ (Send alerts) ↓ ↓ ┌─────────────────────┐ ┌──────────────────────┐ │ Grafana │ │ Alertmanager │ │ - Dashboards │ │ - Route alerts │ │ - Visualization │ │ - Deduplication │ │ - Alerting │ │ - Silencing │ └─────────────────────┘ └──────┬───────────────┘ │ ┌───────────────────┼────────────────┐ ↓ ↓ ↓ ┌──────────┐ ┌──────────┐ ┌──────────┐ │ PagerDuty│ │ Slack │ │ Email │ │(On-call) │ │(#alerts) │ │(Team) │ └──────────┘ └──────────┘ └──────────┘ ┌─────────────────────────────────────────────────────────────────────┐ │ Logging Pipeline │ │ │ │ Application → Winston → ELK Stack / Loki │ │ │ │ ┌──────────┐ ┌──────────────┐ ┌──────────┐ │ │ │ Logs │ ───→ │ Elasticsearch│ ───→ │ Kibana │ │ │ │(JSON) │ │ or Loki │ │(Search) │ │ │ └──────────┘ └──────────────┘ └──────────┘ │ └─────────────────────────────────────────────────────────────────────┘ ┌─────────────────────────────────────────────────────────────────────┐ │ Distributed Tracing │ │ │ │ Application → OpenTelemetry → Jaeger / Tempo │ │ (Trace spans for requests across microservices) │ └─────────────────────────────────────────────────────────────────────┘ ``` ### 1.2 Observability Goals **Why Observability?** - **Proactive Monitoring:** Detect issues before users report them - **Faster Debugging:** Reduce MTTD (Mean Time to Detect) from hours to minutes - **Performance Optimization:** Identify bottlenecks and slow queries - **Capacity Planning:** Predict when to scale resources - **SLA Compliance:** Monitor uptime, response times, error rates **Key Metrics (Google's Four Golden Signals):** 1. **Latency:** Request/response time (p50, p95, p99) 2. **Traffic:** Requests per second (throughput) 3. **Errors:** Error rate (5xx responses, exceptions) 4. **Saturation:** Resource utilization (CPU, memory, disk, DB connections) **SLOs (Service Level Objectives):** - **Availability:** 99.9% uptime (8.76 hours downtime/year) - **Latency:** p95 API response < 300ms - **Error Budget:** <0.1% error rate - **Data Durability:** Zero data loss --- ## 2. OBSERVABILITY PILLARS ### 2.1 The Three Pillars **1. Metrics (What is happening?)** - Quantitative measurements over time - Examples: CPU usage, request count, response time - Tool: Prometheus + Grafana **2. Logs (What happened?)** - Discrete events with context - Examples: "User X logged in", "Query took 2.5s" - Tool: Winston + ELK Stack / Loki **3. Traces (Why did it happen?)** - Request flow across services - Examples: API call → Database query → Redis cache → Response - Tool: OpenTelemetry + Jaeger ### 2.2 Correlation ``` Example: High p99 latency alert ├── Metrics: p99 latency = 3s (threshold: 500ms) │ └── Which endpoint? /api/products │ ├── Logs: Search for slow queries in /api/products │ └── Found: SELECT * FROM inventory.stock_movements (2.8s) │ └── Traces: Trace ID abc123 shows: ├── API handler: 50ms ├── Database query: 2800ms ← Bottleneck! └── Response serialization: 150ms Root cause: Missing index on inventory.stock_movements(product_id) Fix: CREATE INDEX idx_stock_movements_product_id ON inventory.stock_movements(product_id); ``` --- ## 3. PROMETHEUS SETUP ### 3.1 Prometheus Configuration **File:** `prometheus/prometheus.yml` ```yaml global: scrape_interval: 15s # Scrape targets every 15 seconds evaluation_interval: 15s # Evaluate rules every 15 seconds scrape_timeout: 10s external_labels: cluster: 'erp-generic-prod' environment: 'production' # Alertmanager configuration alerting: alertmanagers: - static_configs: - targets: - alertmanager:9093 timeout: 10s # Load alert rules rule_files: - '/etc/prometheus/alerts/application.yml' - '/etc/prometheus/alerts/infrastructure.yml' - '/etc/prometheus/alerts/database.yml' - '/etc/prometheus/alerts/business.yml' # Scrape configurations scrape_configs: # Backend API (NestJS with Prometheus middleware) - job_name: 'erp-backend' static_configs: - targets: ['backend:3000'] labels: service: 'backend' component: 'api' metrics_path: '/metrics' scrape_interval: 15s # PostgreSQL Exporter - job_name: 'postgres' static_configs: - targets: ['postgres-exporter:9187'] labels: service: 'database' component: 'postgres' scrape_interval: 30s # Redis Exporter - job_name: 'redis' static_configs: - targets: ['redis-exporter:9121'] labels: service: 'cache' component: 'redis' scrape_interval: 30s # Node Exporter (system metrics) - job_name: 'node' static_configs: - targets: ['node-exporter:9100'] labels: service: 'infrastructure' component: 'host' scrape_interval: 15s # Frontend (Nginx metrics) - job_name: 'nginx' static_configs: - targets: ['nginx-exporter:9113'] labels: service: 'frontend' component: 'nginx' scrape_interval: 30s # Prometheus itself (meta-monitoring) - job_name: 'prometheus' static_configs: - targets: ['localhost:9090'] labels: service: 'monitoring' component: 'prometheus' ``` ### 3.2 Docker Compose for Monitoring Stack **File:** `docker-compose.monitoring.yml` ```yaml version: '3.9' services: prometheus: image: prom/prometheus:v2.47.0 container_name: erp-prometheus volumes: - ./prometheus/prometheus.yml:/etc/prometheus/prometheus.yml:ro - ./prometheus/alerts:/etc/prometheus/alerts:ro - prometheus_data:/prometheus command: - '--config.file=/etc/prometheus/prometheus.yml' - '--storage.tsdb.path=/prometheus' - '--storage.tsdb.retention.time=15d' - '--web.console.libraries=/usr/share/prometheus/console_libraries' - '--web.console.templates=/usr/share/prometheus/consoles' - '--web.enable-lifecycle' ports: - "9090:9090" networks: - monitoring restart: always alertmanager: image: prom/alertmanager:v0.26.0 container_name: erp-alertmanager volumes: - ./alertmanager/alertmanager.yml:/etc/alertmanager/alertmanager.yml:ro - alertmanager_data:/alertmanager command: - '--config.file=/etc/alertmanager/alertmanager.yml' - '--storage.path=/alertmanager' ports: - "9093:9093" networks: - monitoring restart: always grafana: image: grafana/grafana:10.1.0 container_name: erp-grafana environment: - GF_SECURITY_ADMIN_USER=${GRAFANA_ADMIN_USER:-admin} - GF_SECURITY_ADMIN_PASSWORD=${GRAFANA_ADMIN_PASSWORD:-admin} - GF_INSTALL_PLUGINS=grafana-piechart-panel,grafana-clock-panel - GF_SERVER_ROOT_URL=https://grafana.erp-generic.com - GF_SMTP_ENABLED=true - GF_SMTP_HOST=${SMTP_HOST}:${SMTP_PORT} - GF_SMTP_USER=${SMTP_USER} - GF_SMTP_PASSWORD=${SMTP_PASSWORD} volumes: - grafana_data:/var/lib/grafana - ./grafana/provisioning:/etc/grafana/provisioning:ro - ./grafana/dashboards:/var/lib/grafana/dashboards:ro ports: - "3001:3000" networks: - monitoring depends_on: - prometheus restart: always postgres-exporter: image: prometheuscommunity/postgres-exporter:v0.14.0 container_name: erp-postgres-exporter environment: DATA_SOURCE_NAME: "postgresql://${POSTGRES_USER}:${POSTGRES_PASSWORD}@postgres:5432/${POSTGRES_DB}?sslmode=disable" ports: - "9187:9187" networks: - monitoring - erp-network restart: always redis-exporter: image: oliver006/redis_exporter:v1.54.0 container_name: erp-redis-exporter environment: REDIS_ADDR: "redis:6379" REDIS_PASSWORD: ${REDIS_PASSWORD} ports: - "9121:9121" networks: - monitoring - erp-network restart: always node-exporter: image: prom/node-exporter:v1.6.1 container_name: erp-node-exporter command: - '--path.procfs=/host/proc' - '--path.sysfs=/host/sys' - '--collector.filesystem.mount-points-exclude=^/(sys|proc|dev|host|etc)($$|/)' volumes: - /proc:/host/proc:ro - /sys:/host/sys:ro - /:/rootfs:ro ports: - "9100:9100" networks: - monitoring restart: always volumes: prometheus_data: alertmanager_data: grafana_data: networks: monitoring: name: erp-monitoring erp-network: external: true name: erp-network-internal ``` ### 3.3 Backend Metrics Instrumentation **File:** `backend/src/common/metrics/metrics.module.ts` ```typescript import { Module } from '@nestjs/common'; import { PrometheusModule } from '@willsoto/nestjs-prometheus'; import { MetricsService } from './metrics.service'; @Module({ imports: [ PrometheusModule.register({ path: '/metrics', defaultMetrics: { enabled: true, config: { prefix: 'erp_', }, }, }), ], providers: [MetricsService], exports: [MetricsService], }) export class MetricsModule {} ``` **File:** `backend/src/common/metrics/metrics.service.ts` ```typescript import { Injectable } from '@nestjs/common'; import { Counter, Histogram, Gauge, Registry } from 'prom-client'; @Injectable() export class MetricsService { private registry: Registry; // HTTP Metrics private httpRequestDuration: Histogram; private httpRequestTotal: Counter; private httpRequestErrors: Counter; // Database Metrics private dbQueryDuration: Histogram; private dbConnectionsActive: Gauge; private dbQueryErrors: Counter; // Business Metrics private salesOrdersCreated: Counter; private purchaseOrdersCreated: Counter; private invoicesGenerated: Counter; private inventoryMovements: Counter; // Cache Metrics private cacheHits: Counter; private cacheMisses: Counter; // Authentication Metrics private loginAttempts: Counter; private loginFailures: Counter; private activeUsers: Gauge; constructor() { this.registry = new Registry(); this.initializeMetrics(); } private initializeMetrics() { // HTTP Request Duration this.httpRequestDuration = new Histogram({ name: 'erp_http_request_duration_seconds', help: 'Duration of HTTP requests in seconds', labelNames: ['method', 'route', 'status_code'], buckets: [0.01, 0.05, 0.1, 0.3, 0.5, 1, 2, 5], }); // HTTP Request Total this.httpRequestTotal = new Counter({ name: 'erp_http_requests_total', help: 'Total number of HTTP requests', labelNames: ['method', 'route', 'status_code'], }); // HTTP Request Errors this.httpRequestErrors = new Counter({ name: 'erp_http_request_errors_total', help: 'Total number of HTTP request errors', labelNames: ['method', 'route', 'error_type'], }); // Database Query Duration this.dbQueryDuration = new Histogram({ name: 'erp_db_query_duration_seconds', help: 'Duration of database queries in seconds', labelNames: ['operation', 'table'], buckets: [0.001, 0.005, 0.01, 0.05, 0.1, 0.5, 1, 2], }); // Database Active Connections this.dbConnectionsActive = new Gauge({ name: 'erp_db_connections_active', help: 'Number of active database connections', }); // Database Query Errors this.dbQueryErrors = new Counter({ name: 'erp_db_query_errors_total', help: 'Total number of database query errors', labelNames: ['operation', 'error_type'], }); // Business Metrics - Sales Orders this.salesOrdersCreated = new Counter({ name: 'erp_sales_orders_created_total', help: 'Total number of sales orders created', labelNames: ['tenant_id', 'status'], }); // Business Metrics - Purchase Orders this.purchaseOrdersCreated = new Counter({ name: 'erp_purchase_orders_created_total', help: 'Total number of purchase orders created', labelNames: ['tenant_id', 'status'], }); // Business Metrics - Invoices this.invoicesGenerated = new Counter({ name: 'erp_invoices_generated_total', help: 'Total number of invoices generated', labelNames: ['tenant_id', 'type'], }); // Business Metrics - Inventory Movements this.inventoryMovements = new Counter({ name: 'erp_inventory_movements_total', help: 'Total number of inventory movements', labelNames: ['tenant_id', 'type'], }); // Cache Hits this.cacheHits = new Counter({ name: 'erp_cache_hits_total', help: 'Total number of cache hits', labelNames: ['cache_key'], }); // Cache Misses this.cacheMisses = new Counter({ name: 'erp_cache_misses_total', help: 'Total number of cache misses', labelNames: ['cache_key'], }); // Login Attempts this.loginAttempts = new Counter({ name: 'erp_login_attempts_total', help: 'Total number of login attempts', labelNames: ['tenant_id', 'method'], }); // Login Failures this.loginFailures = new Counter({ name: 'erp_login_failures_total', help: 'Total number of failed login attempts', labelNames: ['tenant_id', 'reason'], }); // Active Users this.activeUsers = new Gauge({ name: 'erp_active_users', help: 'Number of currently active users', labelNames: ['tenant_id'], }); // Register all metrics this.registry.registerMetric(this.httpRequestDuration); this.registry.registerMetric(this.httpRequestTotal); this.registry.registerMetric(this.httpRequestErrors); this.registry.registerMetric(this.dbQueryDuration); this.registry.registerMetric(this.dbConnectionsActive); this.registry.registerMetric(this.dbQueryErrors); this.registry.registerMetric(this.salesOrdersCreated); this.registry.registerMetric(this.purchaseOrdersCreated); this.registry.registerMetric(this.invoicesGenerated); this.registry.registerMetric(this.inventoryMovements); this.registry.registerMetric(this.cacheHits); this.registry.registerMetric(this.cacheMisses); this.registry.registerMetric(this.loginAttempts); this.registry.registerMetric(this.loginFailures); this.registry.registerMetric(this.activeUsers); } // Public methods to record metrics recordHttpRequest(method: string, route: string, statusCode: number, duration: number) { this.httpRequestDuration.observe({ method, route, status_code: statusCode }, duration); this.httpRequestTotal.inc({ method, route, status_code: statusCode }); } recordHttpError(method: string, route: string, errorType: string) { this.httpRequestErrors.inc({ method, route, error_type: errorType }); } recordDbQuery(operation: string, table: string, duration: number) { this.dbQueryDuration.observe({ operation, table }, duration); } recordDbError(operation: string, errorType: string) { this.dbQueryErrors.inc({ operation, error_type: errorType }); } setDbConnectionsActive(count: number) { this.dbConnectionsActive.set(count); } recordSalesOrder(tenantId: string, status: string) { this.salesOrdersCreated.inc({ tenant_id: tenantId, status }); } recordPurchaseOrder(tenantId: string, status: string) { this.purchaseOrdersCreated.inc({ tenant_id: tenantId, status }); } recordInvoice(tenantId: string, type: string) { this.invoicesGenerated.inc({ tenant_id: tenantId, type }); } recordInventoryMovement(tenantId: string, type: string) { this.inventoryMovements.inc({ tenant_id: tenantId, type }); } recordCacheHit(key: string) { this.cacheHits.inc({ cache_key: key }); } recordCacheMiss(key: string) { this.cacheMisses.inc({ cache_key: key }); } recordLoginAttempt(tenantId: string, method: string) { this.loginAttempts.inc({ tenant_id: tenantId, method }); } recordLoginFailure(tenantId: string, reason: string) { this.loginFailures.inc({ tenant_id: tenantId, reason }); } setActiveUsers(tenantId: string, count: number) { this.activeUsers.set({ tenant_id: tenantId }, count); } getMetrics(): string { return this.registry.metrics(); } } ``` **File:** `backend/src/common/interceptors/metrics.interceptor.ts` ```typescript import { Injectable, NestInterceptor, ExecutionContext, CallHandler } from '@nestjs/common'; import { Observable } from 'rxjs'; import { tap } from 'rxjs/operators'; import { MetricsService } from '../metrics/metrics.service'; @Injectable() export class MetricsInterceptor implements NestInterceptor { constructor(private metricsService: MetricsService) {} intercept(context: ExecutionContext, next: CallHandler): Observable { const request = context.switchToHttp().getRequest(); const startTime = Date.now(); return next.handle().pipe( tap({ next: () => { const response = context.switchToHttp().getResponse(); const duration = (Date.now() - startTime) / 1000; // Convert to seconds this.metricsService.recordHttpRequest( request.method, request.route?.path || request.url, response.statusCode, duration, ); }, error: (error) => { const duration = (Date.now() - startTime) / 1000; const response = context.switchToHttp().getResponse(); this.metricsService.recordHttpRequest( request.method, request.route?.path || request.url, response.statusCode || 500, duration, ); this.metricsService.recordHttpError( request.method, request.route?.path || request.url, error.name || 'UnknownError', ); }, }), ); } } ``` --- ## 4. GRAFANA DASHBOARDS ### 4.1 Dashboard Provisioning **File:** `grafana/provisioning/datasources/prometheus.yml` ```yaml apiVersion: 1 datasources: - name: Prometheus type: prometheus access: proxy url: http://prometheus:9090 isDefault: true editable: false jsonData: timeInterval: "15s" queryTimeout: "60s" httpMethod: "POST" ``` **File:** `grafana/provisioning/dashboards/dashboard-provider.yml` ```yaml apiVersion: 1 providers: - name: 'ERP Generic Dashboards' orgId: 1 folder: '' type: file disableDeletion: false updateIntervalSeconds: 10 allowUiUpdates: true options: path: /var/lib/grafana/dashboards foldersFromFilesStructure: true ``` ### 4.2 Dashboard 1: Application Performance **File:** `grafana/dashboards/application-performance.json` (Simplified structure) ```json { "dashboard": { "title": "ERP Generic - Application Performance", "tags": ["erp", "application", "performance"], "timezone": "browser", "panels": [ { "title": "Request Rate (req/s)", "type": "graph", "targets": [ { "expr": "rate(erp_http_requests_total[5m])", "legendFormat": "{{method}} {{route}}" } ] }, { "title": "P95 Latency (ms)", "type": "graph", "targets": [ { "expr": "histogram_quantile(0.95, rate(erp_http_request_duration_seconds_bucket[5m])) * 1000", "legendFormat": "{{route}}" } ], "thresholds": [ { "value": 300, "color": "yellow" }, { "value": 500, "color": "red" } ] }, { "title": "Error Rate (%)", "type": "graph", "targets": [ { "expr": "rate(erp_http_request_errors_total[5m]) / rate(erp_http_requests_total[5m]) * 100", "legendFormat": "{{route}}" } ], "thresholds": [ { "value": 1, "color": "yellow" }, { "value": 5, "color": "red" } ] }, { "title": "Top 10 Slowest Endpoints", "type": "table", "targets": [ { "expr": "topk(10, avg by (route) (erp_http_request_duration_seconds))", "format": "table" } ] }, { "title": "Active Users by Tenant", "type": "graph", "targets": [ { "expr": "erp_active_users", "legendFormat": "{{tenant_id}}" } ] }, { "title": "Cache Hit Rate (%)", "type": "stat", "targets": [ { "expr": "rate(erp_cache_hits_total[5m]) / (rate(erp_cache_hits_total[5m]) + rate(erp_cache_misses_total[5m])) * 100" } ] } ] } } ``` **Key Panels:** 1. **Request Rate:** Total requests per second (by method and route) 2. **P95 Latency:** 95th percentile response time (threshold: 300ms yellow, 500ms red) 3. **Error Rate:** Percentage of failed requests (threshold: 1% yellow, 5% red) 4. **Top 10 Slowest Endpoints:** Identify performance bottlenecks 5. **Active Users by Tenant:** Real-time active user count per tenant 6. **Cache Hit Rate:** Percentage of cache hits (target: >80%) ### 4.3 Dashboard 2: Database Performance **Key Panels:** 1. **Database Connections:** Active vs. max connections 2. **Query Duration P95:** 95th percentile query time by table 3. **Slow Queries:** Queries taking >1 second 4. **Transactions per Second:** TPS rate 5. **Database Size:** Disk usage by schema 6. **Index Usage:** Most and least used indexes 7. **Lock Waits:** Blocking queries 8. **Replication Lag:** Lag between primary and replicas (if applicable) **Example Queries:** ```promql # Active connections pg_stat_database_numbackends{datname="erp_generic"} # Slow queries (>1s) rate(pg_stat_statements_mean_exec_time{datname="erp_generic"}[5m]) > 1000 # Database size pg_database_size_bytes{datname="erp_generic"} # TPS rate(pg_stat_database_xact_commit{datname="erp_generic"}[5m]) + rate(pg_stat_database_xact_rollback{datname="erp_generic"}[5m]) ``` ### 4.4 Dashboard 3: Business Metrics **Key Panels:** 1. **Sales Orders Created (Today):** Total sales orders by status 2. **Purchase Orders Created (Today):** Total purchase orders by status 3. **Revenue Trend (Last 30 days):** Daily revenue by tenant 4. **Invoices Generated (Today):** Total invoices by type (customer/supplier) 5. **Inventory Movements (Today):** Stock in/out movements 6. **Top 10 Customers by Revenue:** Revenue breakdown 7. **Order Fulfillment Rate:** Percentage of orders fulfilled on time 8. **Average Order Value:** Mean order value by tenant **Example Queries:** ```promql # Sales orders created today increase(erp_sales_orders_created_total[1d]) # Revenue trend (requires custom metric) sum by (tenant_id) (rate(erp_sales_order_amount_sum[1d])) # Top 10 customers by revenue topk(10, sum by (customer_id) (erp_sales_order_amount_sum)) ``` --- ## 5. ALERT RULES ### 5.1 Alertmanager Configuration **File:** `alertmanager/alertmanager.yml` ```yaml global: resolve_timeout: 5m smtp_smarthost: '${SMTP_HOST}:${SMTP_PORT}' smtp_from: 'alertmanager@erp-generic.com' smtp_auth_username: '${SMTP_USER}' smtp_auth_password: '${SMTP_PASSWORD}' slack_api_url: '${SLACK_WEBHOOK_URL}' pagerduty_url: 'https://events.pagerduty.com/v2/enqueue' # Route alerts to different receivers route: receiver: 'default' group_by: ['alertname', 'cluster', 'service'] group_wait: 10s group_interval: 10s repeat_interval: 12h routes: # Critical alerts → PagerDuty (on-call) - receiver: 'pagerduty' match: severity: critical continue: true # All alerts → Slack - receiver: 'slack' match_re: severity: critical|warning # Database alerts → DBA team - receiver: 'dba-email' match: component: postgres # Security alerts → Security team - receiver: 'security-email' match_re: alertname: '.*Security.*' # Inhibition rules (suppress alerts) inhibit_rules: # Suppress warning if critical already firing - source_match: severity: 'critical' target_match: severity: 'warning' equal: ['alertname', 'instance'] receivers: - name: 'default' email_configs: - to: 'devops@erp-generic.com' headers: Subject: '[ERP Alert] {{ .GroupLabels.alertname }}' - name: 'pagerduty' pagerduty_configs: - service_key: '${PAGERDUTY_SERVICE_KEY}' description: '{{ .GroupLabels.alertname }}: {{ .GroupLabels.instance }}' - name: 'slack' slack_configs: - channel: '#erp-alerts' title: '{{ .GroupLabels.alertname }}' text: '{{ range .Alerts }}{{ .Annotations.description }}{{ end }}' color: '{{ if eq .Status "firing" }}danger{{ else }}good{{ end }}' - name: 'dba-email' email_configs: - to: 'dba@erp-generic.com' headers: Subject: '[Database Alert] {{ .GroupLabels.alertname }}' - name: 'security-email' email_configs: - to: 'security@erp-generic.com' headers: Subject: '[SECURITY ALERT] {{ .GroupLabels.alertname }}' Priority: 'urgent' ``` ### 5.2 Application Alert Rules **File:** `prometheus/alerts/application.yml` ```yaml groups: - name: erp_application_alerts interval: 30s rules: # High Error Rate - alert: HighErrorRate expr: | (rate(erp_http_request_errors_total[5m]) / rate(erp_http_requests_total[5m])) > 0.05 for: 5m labels: severity: critical component: backend annotations: summary: "High error rate detected on {{ $labels.instance }}" description: "Error rate is {{ $value | humanizePercentage }} (threshold: 5%)" runbook: "https://wiki.erp-generic.com/runbooks/high-error-rate" # High P95 Latency - alert: HighLatency expr: | histogram_quantile(0.95, rate(erp_http_request_duration_seconds_bucket[5m])) > 0.5 for: 10m labels: severity: warning component: backend annotations: summary: "High P95 latency on {{ $labels.route }}" description: "P95 latency is {{ $value }}s (threshold: 500ms)" runbook: "https://wiki.erp-generic.com/runbooks/high-latency" # Service Down - alert: ServiceDown expr: up{job="erp-backend"} == 0 for: 2m labels: severity: critical component: backend annotations: summary: "Backend service is down" description: "Backend {{ $labels.instance }} has been down for more than 2 minutes" runbook: "https://wiki.erp-generic.com/runbooks/service-down" # High CPU Usage - alert: HighCPUUsage expr: | (100 - (avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)) > 80 for: 10m labels: severity: warning component: infrastructure annotations: summary: "High CPU usage on {{ $labels.instance }}" description: "CPU usage is {{ $value }}% (threshold: 80%)" # High Memory Usage - alert: HighMemoryUsage expr: | (1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) > 0.85 for: 5m labels: severity: warning component: infrastructure annotations: summary: "High memory usage on {{ $labels.instance }}" description: "Memory usage is {{ $value | humanizePercentage }} (threshold: 85%)" # Disk Space Low - alert: DiskSpaceLow expr: | (node_filesystem_avail_bytes / node_filesystem_size_bytes) < 0.15 for: 5m labels: severity: warning component: infrastructure annotations: summary: "Low disk space on {{ $labels.instance }}" description: "Disk {{ $labels.mountpoint }} has only {{ $value | humanizePercentage }} free" # Too Many Requests (DDoS protection) - alert: TooManyRequests expr: | rate(erp_http_requests_total[1m]) > 10000 for: 2m labels: severity: critical component: security annotations: summary: "Abnormally high request rate detected" description: "Request rate is {{ $value }} req/s (threshold: 10000 req/s). Possible DDoS attack." runbook: "https://wiki.erp-generic.com/runbooks/ddos-attack" # Low Cache Hit Rate - alert: LowCacheHitRate expr: | (rate(erp_cache_hits_total[5m]) / (rate(erp_cache_hits_total[5m]) + rate(erp_cache_misses_total[5m]))) < 0.6 for: 15m labels: severity: warning component: cache annotations: summary: "Low cache hit rate" description: "Cache hit rate is {{ $value | humanizePercentage }} (threshold: 60%)" ``` ### 5.3 Database Alert Rules **File:** `prometheus/alerts/database.yml` ```yaml groups: - name: erp_database_alerts interval: 30s rules: # Database Down - alert: DatabaseDown expr: pg_up == 0 for: 1m labels: severity: critical component: postgres annotations: summary: "PostgreSQL is down" description: "PostgreSQL on {{ $labels.instance }} has been down for more than 1 minute" runbook: "https://wiki.erp-generic.com/runbooks/database-down" # Connection Pool Exhausted - alert: ConnectionPoolExhausted expr: | (pg_stat_database_numbackends / pg_settings_max_connections) > 0.9 for: 2m labels: severity: critical component: postgres annotations: summary: "Database connection pool almost exhausted" description: "{{ $labels.datname }} is using {{ $value | humanizePercentage }} of max connections" runbook: "https://wiki.erp-generic.com/runbooks/connection-pool-exhausted" # Slow Queries - alert: SlowQueries expr: | rate(pg_stat_statements_mean_exec_time[5m]) > 1000 for: 10m labels: severity: warning component: postgres annotations: summary: "Slow database queries detected" description: "Mean query execution time is {{ $value }}ms (threshold: 1000ms)" runbook: "https://wiki.erp-generic.com/runbooks/slow-queries" # High Number of Deadlocks - alert: HighDeadlocks expr: | rate(pg_stat_database_deadlocks[5m]) > 5 for: 5m labels: severity: warning component: postgres annotations: summary: "High number of database deadlocks" description: "Deadlock rate is {{ $value }}/s (threshold: 5/s)" # Replication Lag (if using replicas) - alert: ReplicationLag expr: | pg_replication_lag_seconds > 60 for: 5m labels: severity: warning component: postgres annotations: summary: "Database replication lag is high" description: "Replication lag is {{ $value }}s (threshold: 60s)" # Disk Space Low (Database) - alert: DatabaseDiskSpaceLow expr: | (node_filesystem_avail_bytes{mountpoint="/var/lib/postgresql"} / node_filesystem_size_bytes{mountpoint="/var/lib/postgresql"}) < 0.15 for: 5m labels: severity: critical component: postgres annotations: summary: "Database disk space is low" description: "Only {{ $value | humanizePercentage }} disk space remaining" runbook: "https://wiki.erp-generic.com/runbooks/database-disk-full" ``` ### 5.4 Business Alert Rules **File:** `prometheus/alerts/business.yml` ```yaml groups: - name: erp_business_alerts interval: 1m rules: # No Sales Orders Created (Business Hours) - alert: NoSalesOrdersCreated expr: | increase(erp_sales_orders_created_total[1h]) == 0 and ON() hour() >= 9 and ON() hour() < 18 for: 1h labels: severity: warning component: business annotations: summary: "No sales orders created in the last hour during business hours" description: "This might indicate a problem with the order creation system" # High Order Cancellation Rate - alert: HighOrderCancellationRate expr: | (rate(erp_sales_orders_created_total{status="cancelled"}[1h]) / rate(erp_sales_orders_created_total[1h])) > 0.2 for: 30m labels: severity: warning component: business annotations: summary: "High order cancellation rate" description: "{{ $value | humanizePercentage }} of orders are being cancelled (threshold: 20%)" # Failed Login Spike - alert: FailedLoginSpike expr: | rate(erp_login_failures_total[5m]) > 10 for: 5m labels: severity: warning component: security annotations: summary: "Spike in failed login attempts" description: "{{ $value }} failed logins per second (threshold: 10/s). Possible brute-force attack." runbook: "https://wiki.erp-generic.com/runbooks/brute-force-attack" ``` --- ## 6. LOGGING STRATEGY ### 6.1 Winston Configuration **File:** `backend/src/common/logger/logger.service.ts` ```typescript import { Injectable, LoggerService as NestLoggerService } from '@nestjs/common'; import * as winston from 'winston'; import 'winston-daily-rotate-file'; @Injectable() export class LoggerService implements NestLoggerService { private logger: winston.Logger; constructor() { this.logger = winston.createLogger({ level: process.env.LOG_LEVEL || 'info', format: winston.format.combine( winston.format.timestamp({ format: 'YYYY-MM-DD HH:mm:ss' }), winston.format.errors({ stack: true }), winston.format.splat(), winston.format.json(), ), defaultMeta: { service: 'erp-generic-backend', environment: process.env.NODE_ENV, }, transports: [ // Console transport (for development) new winston.transports.Console({ format: winston.format.combine( winston.format.colorize(), winston.format.printf(({ timestamp, level, message, context, ...meta }) => { return `${timestamp} [${level}] [${context || 'Application'}] ${message} ${ Object.keys(meta).length ? JSON.stringify(meta, null, 2) : '' }`; }), ), }), // File transport - All logs new winston.transports.DailyRotateFile({ filename: 'logs/application-%DATE%.log', datePattern: 'YYYY-MM-DD', maxSize: '20m', maxFiles: '14d', zippedArchive: true, }), // File transport - Error logs only new winston.transports.DailyRotateFile({ level: 'error', filename: 'logs/error-%DATE%.log', datePattern: 'YYYY-MM-DD', maxSize: '20m', maxFiles: '30d', zippedArchive: true, }), // File transport - Audit logs (security events) new winston.transports.DailyRotateFile({ filename: 'logs/audit-%DATE%.log', datePattern: 'YYYY-MM-DD', maxSize: '50m', maxFiles: '90d', // Keep for 90 days (compliance) zippedArchive: true, }), ], }); // Add Elasticsearch/Loki transport for production if (process.env.NODE_ENV === 'production') { // Example: Winston-Elasticsearch // this.logger.add(new WinstonElasticsearch({ // level: 'info', // clientOpts: { // node: process.env.ELASTICSEARCH_URL, // auth: { // username: process.env.ELASTICSEARCH_USER, // password: process.env.ELASTICSEARCH_PASSWORD, // }, // }, // index: 'erp-generic-logs', // })); } } log(message: string, context?: string, meta?: any) { this.logger.info(message, { context, ...meta }); } error(message: string, trace?: string, context?: string, meta?: any) { this.logger.error(message, { trace, context, ...meta }); } warn(message: string, context?: string, meta?: any) { this.logger.warn(message, { context, ...meta }); } debug(message: string, context?: string, meta?: any) { this.logger.debug(message, { context, ...meta }); } verbose(message: string, context?: string, meta?: any) { this.logger.verbose(message, { context, ...meta }); } // Audit logging (security-sensitive events) audit(event: string, userId: string, tenantId: string, details: any) { this.logger.info('AUDIT_EVENT', { event, userId, tenantId, details, timestamp: new Date().toISOString(), ip: details.ip, userAgent: details.userAgent, }); } } ``` ### 6.2 Structured Logging Examples ```typescript // Login attempt logger.audit('USER_LOGIN', userId, tenantId, { method: 'email', ip: request.ip, userAgent: request.headers['user-agent'], success: true, }); // Database query logger.debug('DB_QUERY', 'DatabaseService', { operation: 'SELECT', table: 'auth.users', duration: 45, // ms rowCount: 1, }); // API request logger.info('HTTP_REQUEST', 'HttpMiddleware', { method: 'POST', path: '/api/sales/orders', statusCode: 201, duration: 234, // ms userId: '123e4567-e89b-12d3-a456-426614174000', tenantId: 'tenant-abc', }); // Error with stack trace logger.error('ORDER_CREATION_FAILED', error.stack, 'OrderService', { orderId: '123', tenantId: 'tenant-abc', error: error.message, }); ``` ### 6.3 Log Aggregation (ELK Stack) **Docker Compose for ELK Stack:** ```yaml version: '3.9' services: elasticsearch: image: docker.elastic.co/elasticsearch/elasticsearch:8.10.0 container_name: erp-elasticsearch environment: - discovery.type=single-node - ES_JAVA_OPTS=-Xms2g -Xmx2g - xpack.security.enabled=false volumes: - elasticsearch_data:/usr/share/elasticsearch/data ports: - "9200:9200" networks: - monitoring restart: always logstash: image: docker.elastic.co/logstash/logstash:8.10.0 container_name: erp-logstash volumes: - ./logstash/logstash.conf:/usr/share/logstash/pipeline/logstash.conf:ro ports: - "5044:5044" environment: LS_JAVA_OPTS: "-Xmx512m -Xms512m" networks: - monitoring depends_on: - elasticsearch restart: always kibana: image: docker.elastic.co/kibana/kibana:8.10.0 container_name: erp-kibana ports: - "5601:5601" environment: ELASTICSEARCH_URL: http://elasticsearch:9200 ELASTICSEARCH_HOSTS: '["http://elasticsearch:9200"]' networks: - monitoring depends_on: - elasticsearch restart: always volumes: elasticsearch_data: networks: monitoring: external: true name: erp-monitoring ``` **Logstash Configuration:** ```conf input { file { path => "/var/log/erp-generic/application-*.log" type => "application" codec => json start_position => "beginning" } file { path => "/var/log/erp-generic/error-*.log" type => "error" codec => json start_position => "beginning" } file { path => "/var/log/erp-generic/audit-*.log" type => "audit" codec => json start_position => "beginning" } } filter { # Parse timestamp date { match => [ "timestamp", "ISO8601" ] target => "@timestamp" } # Add geoip for IP addresses if [ip] { geoip { source => "ip" target => "geoip" } } # Extract tenant_id as a field if [tenantId] { mutate { add_field => { "tenant" => "%{tenantId}" } } } } output { elasticsearch { hosts => ["elasticsearch:9200"] index => "erp-generic-logs-%{+YYYY.MM.dd}" } # Debug output (optional) stdout { codec => rubydebug } } ``` --- ## 7. APPLICATION PERFORMANCE MONITORING (APM) ### 7.1 Custom Metrics Endpoints **File:** `backend/src/metrics/metrics.controller.ts` ```typescript import { Controller, Get } from '@nestjs/common'; import { MetricsService } from '../common/metrics/metrics.service'; import { PrismaService } from '../common/prisma/prisma.service'; @Controller('metrics') export class MetricsController { constructor( private metricsService: MetricsService, private prisma: PrismaService, ) {} @Get() getMetrics() { return this.metricsService.getMetrics(); } @Get('business') async getBusinessMetrics() { // Aggregate business metrics from database const [salesOrders, purchaseOrders, invoices, activeUsers] = await Promise.all([ this.prisma.salesOrder.count(), this.prisma.purchaseOrder.count(), this.prisma.invoice.count(), this.prisma.user.count({ where: { status: 'active' } }), ]); return { sales_orders_total: salesOrders, purchase_orders_total: purchaseOrders, invoices_total: invoices, active_users_total: activeUsers, }; } } ``` ### 7.2 Performance Profiling **Prisma Query Logging:** ```typescript // prisma/prisma.service.ts import { Injectable, OnModuleInit } from '@nestjs/common'; import { PrismaClient } from '@prisma/client'; import { LoggerService } from '../logger/logger.service'; @Injectable() export class PrismaService extends PrismaClient implements OnModuleInit { constructor(private logger: LoggerService) { super({ log: [ { emit: 'event', level: 'query' }, { emit: 'event', level: 'error' }, { emit: 'event', level: 'warn' }, ], }); // Log slow queries (>100ms) this.$on('query' as never, (e: any) => { if (e.duration > 100) { this.logger.warn('SLOW_QUERY', 'PrismaService', { query: e.query, duration: e.duration, params: e.params, }); } }); // Log query errors this.$on('error' as never, (e: any) => { this.logger.error('DB_ERROR', e.message, 'PrismaService', { target: e.target, }); }); } async onModuleInit() { await this.$connect(); } } ``` --- ## 8. HEALTH CHECKS ### 8.1 Health Check Endpoints ```typescript // health/health.controller.ts import { Controller, Get } from '@nestjs/common'; import { HealthCheck, HealthCheckService, PrismaHealthIndicator, MemoryHealthIndicator, DiskHealthIndicator } from '@nestjs/terminus'; import { RedisHealthIndicator } from './redis.health'; @Controller('health') export class HealthController { constructor( private health: HealthCheckService, private db: PrismaHealthIndicator, private redis: RedisHealthIndicator, private memory: MemoryHealthIndicator, private disk: DiskHealthIndicator, ) {} @Get() @HealthCheck() check() { return this.health.check([ () => this.db.pingCheck('database', { timeout: 3000 }), () => this.redis.isHealthy('redis'), () => this.memory.checkHeap('memory_heap', 200 * 1024 * 1024), () => this.disk.checkStorage('disk', { path: '/', thresholdPercent: 0.9 }), ]); } @Get('live') liveness() { return { status: 'ok', timestamp: new Date().toISOString() }; } @Get('ready') @HealthCheck() readiness() { return this.health.check([ () => this.db.pingCheck('database'), () => this.redis.isHealthy('redis'), ]); } } ``` --- ## 9. DISTRIBUTED TRACING ### 9.1 OpenTelemetry Setup ```typescript // tracing.ts (Bootstrap file) import { NodeSDK } from '@opentelemetry/sdk-node'; import { getNodeAutoInstrumentations } from '@opentelemetry/auto-instrumentations-node'; import { JaegerExporter } from '@opentelemetry/exporter-jaeger'; const sdk = new NodeSDK({ traceExporter: new JaegerExporter({ endpoint: process.env.JAEGER_ENDPOINT || 'http://jaeger:14268/api/traces', }), instrumentations: [getNodeAutoInstrumentations()], }); sdk.start(); ``` --- ## 10. ON-CALL & INCIDENT RESPONSE ### 10.1 On-Call Rotation - **Primary On-Call:** DevOps Engineer (24/7) - **Secondary On-Call:** Backend Lead - **Escalation Path:** CTO → CEO ### 10.2 Incident Severity | Severity | Response Time | Examples | |----------|---------------|----------| | **P0 (Critical)** | 15 min | System down, data loss | | **P1 (High)** | 1 hour | Major feature broken | | **P2 (Medium)** | 4 hours | Minor feature broken | | **P3 (Low)** | 24 hours | Cosmetic issue | --- ## 11. REFERENCES - [Deployment Guide](./DEPLOYMENT-GUIDE.md) - [Prometheus Documentation](https://prometheus.io/docs/) - [Grafana Documentation](https://grafana.com/docs/) - [Google SRE Book](https://sre.google/sre-book/table-of-contents/) --- **Documento:** MONITORING-OBSERVABILITY.md **Versión:** 1.0 **Total Páginas:** ~18 **Última Actualización:** 2025-11-24