workspace-v1/projects/erp-core/docs/07-devops/MONITORING-OBSERVABILITY.md
rckrdmrd 66161b1566 feat: Workspace-v1 complete migration with NEXUS v3.4
Sistema NEXUS v3.4 migrado con:

Estructura principal:
- core/orchestration: Sistema SIMCO + CAPVED (27 directivas, 28 perfiles)
- core/catalog: Catalogo de funcionalidades reutilizables
- shared/knowledge-base: Base de conocimiento compartida
- devtools/scripts: Herramientas de desarrollo
- control-plane/registries: Control de servicios y CI/CD
- orchestration/: Configuracion de orquestacion de agentes

Proyectos incluidos (11):
- gamilit (submodule -> GitHub)
- trading-platform (OrbiquanTIA)
- erp-suite con 5 verticales:
  - erp-core, construccion, vidrio-templado
  - mecanicas-diesel, retail, clinicas
- betting-analytics
- inmobiliaria-analytics
- platform_marketing_content
- pos-micro, erp-basico

Configuracion:
- .gitignore completo para Node.js/Python/Docker
- gamilit como submodule (git@github.com:rckrdmrd/gamilit-workspace.git)
- Sistema de puertos estandarizado (3005-3199)

Generated with NEXUS v3.4 Migration System
EPIC-010: Configuracion Git y Repositorios
2026-01-04 03:37:42 -06:00

49 KiB

MONITORING & OBSERVABILITY - ERP Generic

Última actualización: 2025-11-24 Responsable: DevOps Team / SRE Team Estado: Production-Ready


TABLE OF CONTENTS

  1. Overview
  2. Observability Pillars
  3. Prometheus Setup
  4. Grafana Dashboards
  5. Alert Rules
  6. Logging Strategy
  7. Application Performance Monitoring (APM)
  8. Health Checks
  9. Distributed Tracing
  10. On-Call & Incident Response
  11. References

1. OVERVIEW

1.1 Monitoring Architecture

┌─────────────────────────────────────────────────────────────────────┐
│                         Application Layer                           │
│  ┌──────────┐  ┌──────────┐  ┌──────────┐  ┌──────────┐           │
│  │ Backend  │  │ Frontend │  │ Postgres │  │  Redis   │           │
│  │ (Metrics)│  │ (Metrics)│  │(Exporter)│  │(Exporter)│           │
│  └────┬─────┘  └────┬─────┘  └────┬─────┘  └────┬─────┘           │
│       │             │             │             │                   │
│       └─────────────┴─────────────┴─────────────┘                   │
│                            │                                         │
└────────────────────────────┼─────────────────────────────────────────┘
                             │ (Scrape metrics every 15s)
                             ↓
┌─────────────────────────────────────────────────────────────────────┐
│                      Prometheus (TSDB)                              │
│  - Collects metrics from all targets                               │
│  - Evaluates alert rules                                           │
│  - Stores time-series data (15 days retention)                     │
└────────┬────────────────────────────────┬─────────────────────────┘
         │                                │
         │ (Query metrics)                │ (Send alerts)
         ↓                                ↓
┌─────────────────────┐         ┌──────────────────────┐
│   Grafana           │         │  Alertmanager        │
│   - Dashboards      │         │  - Route alerts      │
│   - Visualization   │         │  - Deduplication     │
│   - Alerting        │         │  - Silencing         │
└─────────────────────┘         └──────┬───────────────┘
                                       │
                   ┌───────────────────┼────────────────┐
                   ↓                   ↓                ↓
            ┌──────────┐        ┌──────────┐    ┌──────────┐
            │ PagerDuty│        │  Slack   │    │  Email   │
            │(On-call) │        │(#alerts) │    │(Team)    │
            └──────────┘        └──────────┘    └──────────┘

┌─────────────────────────────────────────────────────────────────────┐
│                      Logging Pipeline                               │
│                                                                     │
│  Application → Winston → ELK Stack / Loki                          │
│                                                                     │
│  ┌──────────┐      ┌──────────────┐      ┌──────────┐             │
│  │  Logs    │ ───→ │ Elasticsearch│ ───→ │  Kibana  │             │
│  │(JSON)    │      │  or Loki     │      │(Search)  │             │
│  └──────────┘      └──────────────┘      └──────────┘             │
└─────────────────────────────────────────────────────────────────────┘

┌─────────────────────────────────────────────────────────────────────┐
│                     Distributed Tracing                             │
│                                                                     │
│  Application → OpenTelemetry → Jaeger / Tempo                      │
│  (Trace spans for requests across microservices)                   │
└─────────────────────────────────────────────────────────────────────┘

1.2 Observability Goals

Why Observability?

  • Proactive Monitoring: Detect issues before users report them
  • Faster Debugging: Reduce MTTD (Mean Time to Detect) from hours to minutes
  • Performance Optimization: Identify bottlenecks and slow queries
  • Capacity Planning: Predict when to scale resources
  • SLA Compliance: Monitor uptime, response times, error rates

Key Metrics (Google's Four Golden Signals):

  1. Latency: Request/response time (p50, p95, p99)
  2. Traffic: Requests per second (throughput)
  3. Errors: Error rate (5xx responses, exceptions)
  4. Saturation: Resource utilization (CPU, memory, disk, DB connections)

SLOs (Service Level Objectives):

  • Availability: 99.9% uptime (8.76 hours downtime/year)
  • Latency: p95 API response < 300ms
  • Error Budget: <0.1% error rate
  • Data Durability: Zero data loss

2. OBSERVABILITY PILLARS

2.1 The Three Pillars

1. Metrics (What is happening?)

  • Quantitative measurements over time
  • Examples: CPU usage, request count, response time
  • Tool: Prometheus + Grafana

2. Logs (What happened?)

  • Discrete events with context
  • Examples: "User X logged in", "Query took 2.5s"
  • Tool: Winston + ELK Stack / Loki

3. Traces (Why did it happen?)

  • Request flow across services
  • Examples: API call → Database query → Redis cache → Response
  • Tool: OpenTelemetry + Jaeger

2.2 Correlation

Example: High p99 latency alert
├── Metrics: p99 latency = 3s (threshold: 500ms)
│   └── Which endpoint? /api/products
│
├── Logs: Search for slow queries in /api/products
│   └── Found: SELECT * FROM inventory.stock_movements (2.8s)
│
└── Traces: Trace ID abc123 shows:
    ├── API handler: 50ms
    ├── Database query: 2800ms ← Bottleneck!
    └── Response serialization: 150ms

Root cause: Missing index on inventory.stock_movements(product_id)
Fix: CREATE INDEX idx_stock_movements_product_id ON inventory.stock_movements(product_id);

3. PROMETHEUS SETUP

3.1 Prometheus Configuration

File: prometheus/prometheus.yml

global:
  scrape_interval: 15s              # Scrape targets every 15 seconds
  evaluation_interval: 15s          # Evaluate rules every 15 seconds
  scrape_timeout: 10s
  external_labels:
    cluster: 'erp-generic-prod'
    environment: 'production'

# Alertmanager configuration
alerting:
  alertmanagers:
    - static_configs:
        - targets:
            - alertmanager:9093
      timeout: 10s

# Load alert rules
rule_files:
  - '/etc/prometheus/alerts/application.yml'
  - '/etc/prometheus/alerts/infrastructure.yml'
  - '/etc/prometheus/alerts/database.yml'
  - '/etc/prometheus/alerts/business.yml'

# Scrape configurations
scrape_configs:
  # Backend API (NestJS with Prometheus middleware)
  - job_name: 'erp-backend'
    static_configs:
      - targets: ['backend:3000']
        labels:
          service: 'backend'
          component: 'api'
    metrics_path: '/metrics'
    scrape_interval: 15s

  # PostgreSQL Exporter
  - job_name: 'postgres'
    static_configs:
      - targets: ['postgres-exporter:9187']
        labels:
          service: 'database'
          component: 'postgres'
    scrape_interval: 30s

  # Redis Exporter
  - job_name: 'redis'
    static_configs:
      - targets: ['redis-exporter:9121']
        labels:
          service: 'cache'
          component: 'redis'
    scrape_interval: 30s

  # Node Exporter (system metrics)
  - job_name: 'node'
    static_configs:
      - targets: ['node-exporter:9100']
        labels:
          service: 'infrastructure'
          component: 'host'
    scrape_interval: 15s

  # Frontend (Nginx metrics)
  - job_name: 'nginx'
    static_configs:
      - targets: ['nginx-exporter:9113']
        labels:
          service: 'frontend'
          component: 'nginx'
    scrape_interval: 30s

  # Prometheus itself (meta-monitoring)
  - job_name: 'prometheus'
    static_configs:
      - targets: ['localhost:9090']
        labels:
          service: 'monitoring'
          component: 'prometheus'

3.2 Docker Compose for Monitoring Stack

File: docker-compose.monitoring.yml

version: '3.9'

services:
  prometheus:
    image: prom/prometheus:v2.47.0
    container_name: erp-prometheus
    volumes:
      - ./prometheus/prometheus.yml:/etc/prometheus/prometheus.yml:ro
      - ./prometheus/alerts:/etc/prometheus/alerts:ro
      - prometheus_data:/prometheus
    command:
      - '--config.file=/etc/prometheus/prometheus.yml'
      - '--storage.tsdb.path=/prometheus'
      - '--storage.tsdb.retention.time=15d'
      - '--web.console.libraries=/usr/share/prometheus/console_libraries'
      - '--web.console.templates=/usr/share/prometheus/consoles'
      - '--web.enable-lifecycle'
    ports:
      - "9090:9090"
    networks:
      - monitoring
    restart: always

  alertmanager:
    image: prom/alertmanager:v0.26.0
    container_name: erp-alertmanager
    volumes:
      - ./alertmanager/alertmanager.yml:/etc/alertmanager/alertmanager.yml:ro
      - alertmanager_data:/alertmanager
    command:
      - '--config.file=/etc/alertmanager/alertmanager.yml'
      - '--storage.path=/alertmanager'
    ports:
      - "9093:9093"
    networks:
      - monitoring
    restart: always

  grafana:
    image: grafana/grafana:10.1.0
    container_name: erp-grafana
    environment:
      - GF_SECURITY_ADMIN_USER=${GRAFANA_ADMIN_USER:-admin}
      - GF_SECURITY_ADMIN_PASSWORD=${GRAFANA_ADMIN_PASSWORD:-admin}
      - GF_INSTALL_PLUGINS=grafana-piechart-panel,grafana-clock-panel
      - GF_SERVER_ROOT_URL=https://grafana.erp-generic.com
      - GF_SMTP_ENABLED=true
      - GF_SMTP_HOST=${SMTP_HOST}:${SMTP_PORT}
      - GF_SMTP_USER=${SMTP_USER}
      - GF_SMTP_PASSWORD=${SMTP_PASSWORD}
    volumes:
      - grafana_data:/var/lib/grafana
      - ./grafana/provisioning:/etc/grafana/provisioning:ro
      - ./grafana/dashboards:/var/lib/grafana/dashboards:ro
    ports:
      - "3001:3000"
    networks:
      - monitoring
    depends_on:
      - prometheus
    restart: always

  postgres-exporter:
    image: prometheuscommunity/postgres-exporter:v0.14.0
    container_name: erp-postgres-exporter
    environment:
      DATA_SOURCE_NAME: "postgresql://${POSTGRES_USER}:${POSTGRES_PASSWORD}@postgres:5432/${POSTGRES_DB}?sslmode=disable"
    ports:
      - "9187:9187"
    networks:
      - monitoring
      - erp-network
    restart: always

  redis-exporter:
    image: oliver006/redis_exporter:v1.54.0
    container_name: erp-redis-exporter
    environment:
      REDIS_ADDR: "redis:6379"
      REDIS_PASSWORD: ${REDIS_PASSWORD}
    ports:
      - "9121:9121"
    networks:
      - monitoring
      - erp-network
    restart: always

  node-exporter:
    image: prom/node-exporter:v1.6.1
    container_name: erp-node-exporter
    command:
      - '--path.procfs=/host/proc'
      - '--path.sysfs=/host/sys'
      - '--collector.filesystem.mount-points-exclude=^/(sys|proc|dev|host|etc)($$|/)'
    volumes:
      - /proc:/host/proc:ro
      - /sys:/host/sys:ro
      - /:/rootfs:ro
    ports:
      - "9100:9100"
    networks:
      - monitoring
    restart: always

volumes:
  prometheus_data:
  alertmanager_data:
  grafana_data:

networks:
  monitoring:
    name: erp-monitoring
  erp-network:
    external: true
    name: erp-network-internal

3.3 Backend Metrics Instrumentation

File: backend/src/common/metrics/metrics.module.ts

import { Module } from '@nestjs/common';
import { PrometheusModule } from '@willsoto/nestjs-prometheus';
import { MetricsService } from './metrics.service';

@Module({
  imports: [
    PrometheusModule.register({
      path: '/metrics',
      defaultMetrics: {
        enabled: true,
        config: {
          prefix: 'erp_',
        },
      },
    }),
  ],
  providers: [MetricsService],
  exports: [MetricsService],
})
export class MetricsModule {}

File: backend/src/common/metrics/metrics.service.ts

import { Injectable } from '@nestjs/common';
import { Counter, Histogram, Gauge, Registry } from 'prom-client';

@Injectable()
export class MetricsService {
  private registry: Registry;

  // HTTP Metrics
  private httpRequestDuration: Histogram;
  private httpRequestTotal: Counter;
  private httpRequestErrors: Counter;

  // Database Metrics
  private dbQueryDuration: Histogram;
  private dbConnectionsActive: Gauge;
  private dbQueryErrors: Counter;

  // Business Metrics
  private salesOrdersCreated: Counter;
  private purchaseOrdersCreated: Counter;
  private invoicesGenerated: Counter;
  private inventoryMovements: Counter;

  // Cache Metrics
  private cacheHits: Counter;
  private cacheMisses: Counter;

  // Authentication Metrics
  private loginAttempts: Counter;
  private loginFailures: Counter;
  private activeUsers: Gauge;

  constructor() {
    this.registry = new Registry();
    this.initializeMetrics();
  }

  private initializeMetrics() {
    // HTTP Request Duration
    this.httpRequestDuration = new Histogram({
      name: 'erp_http_request_duration_seconds',
      help: 'Duration of HTTP requests in seconds',
      labelNames: ['method', 'route', 'status_code'],
      buckets: [0.01, 0.05, 0.1, 0.3, 0.5, 1, 2, 5],
    });

    // HTTP Request Total
    this.httpRequestTotal = new Counter({
      name: 'erp_http_requests_total',
      help: 'Total number of HTTP requests',
      labelNames: ['method', 'route', 'status_code'],
    });

    // HTTP Request Errors
    this.httpRequestErrors = new Counter({
      name: 'erp_http_request_errors_total',
      help: 'Total number of HTTP request errors',
      labelNames: ['method', 'route', 'error_type'],
    });

    // Database Query Duration
    this.dbQueryDuration = new Histogram({
      name: 'erp_db_query_duration_seconds',
      help: 'Duration of database queries in seconds',
      labelNames: ['operation', 'table'],
      buckets: [0.001, 0.005, 0.01, 0.05, 0.1, 0.5, 1, 2],
    });

    // Database Active Connections
    this.dbConnectionsActive = new Gauge({
      name: 'erp_db_connections_active',
      help: 'Number of active database connections',
    });

    // Database Query Errors
    this.dbQueryErrors = new Counter({
      name: 'erp_db_query_errors_total',
      help: 'Total number of database query errors',
      labelNames: ['operation', 'error_type'],
    });

    // Business Metrics - Sales Orders
    this.salesOrdersCreated = new Counter({
      name: 'erp_sales_orders_created_total',
      help: 'Total number of sales orders created',
      labelNames: ['tenant_id', 'status'],
    });

    // Business Metrics - Purchase Orders
    this.purchaseOrdersCreated = new Counter({
      name: 'erp_purchase_orders_created_total',
      help: 'Total number of purchase orders created',
      labelNames: ['tenant_id', 'status'],
    });

    // Business Metrics - Invoices
    this.invoicesGenerated = new Counter({
      name: 'erp_invoices_generated_total',
      help: 'Total number of invoices generated',
      labelNames: ['tenant_id', 'type'],
    });

    // Business Metrics - Inventory Movements
    this.inventoryMovements = new Counter({
      name: 'erp_inventory_movements_total',
      help: 'Total number of inventory movements',
      labelNames: ['tenant_id', 'type'],
    });

    // Cache Hits
    this.cacheHits = new Counter({
      name: 'erp_cache_hits_total',
      help: 'Total number of cache hits',
      labelNames: ['cache_key'],
    });

    // Cache Misses
    this.cacheMisses = new Counter({
      name: 'erp_cache_misses_total',
      help: 'Total number of cache misses',
      labelNames: ['cache_key'],
    });

    // Login Attempts
    this.loginAttempts = new Counter({
      name: 'erp_login_attempts_total',
      help: 'Total number of login attempts',
      labelNames: ['tenant_id', 'method'],
    });

    // Login Failures
    this.loginFailures = new Counter({
      name: 'erp_login_failures_total',
      help: 'Total number of failed login attempts',
      labelNames: ['tenant_id', 'reason'],
    });

    // Active Users
    this.activeUsers = new Gauge({
      name: 'erp_active_users',
      help: 'Number of currently active users',
      labelNames: ['tenant_id'],
    });

    // Register all metrics
    this.registry.registerMetric(this.httpRequestDuration);
    this.registry.registerMetric(this.httpRequestTotal);
    this.registry.registerMetric(this.httpRequestErrors);
    this.registry.registerMetric(this.dbQueryDuration);
    this.registry.registerMetric(this.dbConnectionsActive);
    this.registry.registerMetric(this.dbQueryErrors);
    this.registry.registerMetric(this.salesOrdersCreated);
    this.registry.registerMetric(this.purchaseOrdersCreated);
    this.registry.registerMetric(this.invoicesGenerated);
    this.registry.registerMetric(this.inventoryMovements);
    this.registry.registerMetric(this.cacheHits);
    this.registry.registerMetric(this.cacheMisses);
    this.registry.registerMetric(this.loginAttempts);
    this.registry.registerMetric(this.loginFailures);
    this.registry.registerMetric(this.activeUsers);
  }

  // Public methods to record metrics
  recordHttpRequest(method: string, route: string, statusCode: number, duration: number) {
    this.httpRequestDuration.observe({ method, route, status_code: statusCode }, duration);
    this.httpRequestTotal.inc({ method, route, status_code: statusCode });
  }

  recordHttpError(method: string, route: string, errorType: string) {
    this.httpRequestErrors.inc({ method, route, error_type: errorType });
  }

  recordDbQuery(operation: string, table: string, duration: number) {
    this.dbQueryDuration.observe({ operation, table }, duration);
  }

  recordDbError(operation: string, errorType: string) {
    this.dbQueryErrors.inc({ operation, error_type: errorType });
  }

  setDbConnectionsActive(count: number) {
    this.dbConnectionsActive.set(count);
  }

  recordSalesOrder(tenantId: string, status: string) {
    this.salesOrdersCreated.inc({ tenant_id: tenantId, status });
  }

  recordPurchaseOrder(tenantId: string, status: string) {
    this.purchaseOrdersCreated.inc({ tenant_id: tenantId, status });
  }

  recordInvoice(tenantId: string, type: string) {
    this.invoicesGenerated.inc({ tenant_id: tenantId, type });
  }

  recordInventoryMovement(tenantId: string, type: string) {
    this.inventoryMovements.inc({ tenant_id: tenantId, type });
  }

  recordCacheHit(key: string) {
    this.cacheHits.inc({ cache_key: key });
  }

  recordCacheMiss(key: string) {
    this.cacheMisses.inc({ cache_key: key });
  }

  recordLoginAttempt(tenantId: string, method: string) {
    this.loginAttempts.inc({ tenant_id: tenantId, method });
  }

  recordLoginFailure(tenantId: string, reason: string) {
    this.loginFailures.inc({ tenant_id: tenantId, reason });
  }

  setActiveUsers(tenantId: string, count: number) {
    this.activeUsers.set({ tenant_id: tenantId }, count);
  }

  getMetrics(): string {
    return this.registry.metrics();
  }
}

File: backend/src/common/interceptors/metrics.interceptor.ts

import { Injectable, NestInterceptor, ExecutionContext, CallHandler } from '@nestjs/common';
import { Observable } from 'rxjs';
import { tap } from 'rxjs/operators';
import { MetricsService } from '../metrics/metrics.service';

@Injectable()
export class MetricsInterceptor implements NestInterceptor {
  constructor(private metricsService: MetricsService) {}

  intercept(context: ExecutionContext, next: CallHandler): Observable<any> {
    const request = context.switchToHttp().getRequest();
    const startTime = Date.now();

    return next.handle().pipe(
      tap({
        next: () => {
          const response = context.switchToHttp().getResponse();
          const duration = (Date.now() - startTime) / 1000; // Convert to seconds

          this.metricsService.recordHttpRequest(
            request.method,
            request.route?.path || request.url,
            response.statusCode,
            duration,
          );
        },
        error: (error) => {
          const duration = (Date.now() - startTime) / 1000;
          const response = context.switchToHttp().getResponse();

          this.metricsService.recordHttpRequest(
            request.method,
            request.route?.path || request.url,
            response.statusCode || 500,
            duration,
          );

          this.metricsService.recordHttpError(
            request.method,
            request.route?.path || request.url,
            error.name || 'UnknownError',
          );
        },
      }),
    );
  }
}

4. GRAFANA DASHBOARDS

4.1 Dashboard Provisioning

File: grafana/provisioning/datasources/prometheus.yml

apiVersion: 1

datasources:
  - name: Prometheus
    type: prometheus
    access: proxy
    url: http://prometheus:9090
    isDefault: true
    editable: false
    jsonData:
      timeInterval: "15s"
      queryTimeout: "60s"
      httpMethod: "POST"

File: grafana/provisioning/dashboards/dashboard-provider.yml

apiVersion: 1

providers:
  - name: 'ERP Generic Dashboards'
    orgId: 1
    folder: ''
    type: file
    disableDeletion: false
    updateIntervalSeconds: 10
    allowUiUpdates: true
    options:
      path: /var/lib/grafana/dashboards
      foldersFromFilesStructure: true

4.2 Dashboard 1: Application Performance

File: grafana/dashboards/application-performance.json (Simplified structure)

{
  "dashboard": {
    "title": "ERP Generic - Application Performance",
    "tags": ["erp", "application", "performance"],
    "timezone": "browser",
    "panels": [
      {
        "title": "Request Rate (req/s)",
        "type": "graph",
        "targets": [
          {
            "expr": "rate(erp_http_requests_total[5m])",
            "legendFormat": "{{method}} {{route}}"
          }
        ]
      },
      {
        "title": "P95 Latency (ms)",
        "type": "graph",
        "targets": [
          {
            "expr": "histogram_quantile(0.95, rate(erp_http_request_duration_seconds_bucket[5m])) * 1000",
            "legendFormat": "{{route}}"
          }
        ],
        "thresholds": [
          { "value": 300, "color": "yellow" },
          { "value": 500, "color": "red" }
        ]
      },
      {
        "title": "Error Rate (%)",
        "type": "graph",
        "targets": [
          {
            "expr": "rate(erp_http_request_errors_total[5m]) / rate(erp_http_requests_total[5m]) * 100",
            "legendFormat": "{{route}}"
          }
        ],
        "thresholds": [
          { "value": 1, "color": "yellow" },
          { "value": 5, "color": "red" }
        ]
      },
      {
        "title": "Top 10 Slowest Endpoints",
        "type": "table",
        "targets": [
          {
            "expr": "topk(10, avg by (route) (erp_http_request_duration_seconds))",
            "format": "table"
          }
        ]
      },
      {
        "title": "Active Users by Tenant",
        "type": "graph",
        "targets": [
          {
            "expr": "erp_active_users",
            "legendFormat": "{{tenant_id}}"
          }
        ]
      },
      {
        "title": "Cache Hit Rate (%)",
        "type": "stat",
        "targets": [
          {
            "expr": "rate(erp_cache_hits_total[5m]) / (rate(erp_cache_hits_total[5m]) + rate(erp_cache_misses_total[5m])) * 100"
          }
        ]
      }
    ]
  }
}

Key Panels:

  1. Request Rate: Total requests per second (by method and route)
  2. P95 Latency: 95th percentile response time (threshold: 300ms yellow, 500ms red)
  3. Error Rate: Percentage of failed requests (threshold: 1% yellow, 5% red)
  4. Top 10 Slowest Endpoints: Identify performance bottlenecks
  5. Active Users by Tenant: Real-time active user count per tenant
  6. Cache Hit Rate: Percentage of cache hits (target: >80%)

4.3 Dashboard 2: Database Performance

Key Panels:

  1. Database Connections: Active vs. max connections
  2. Query Duration P95: 95th percentile query time by table
  3. Slow Queries: Queries taking >1 second
  4. Transactions per Second: TPS rate
  5. Database Size: Disk usage by schema
  6. Index Usage: Most and least used indexes
  7. Lock Waits: Blocking queries
  8. Replication Lag: Lag between primary and replicas (if applicable)

Example Queries:

# Active connections
pg_stat_database_numbackends{datname="erp_generic"}

# Slow queries (>1s)
rate(pg_stat_statements_mean_exec_time{datname="erp_generic"}[5m]) > 1000

# Database size
pg_database_size_bytes{datname="erp_generic"}

# TPS
rate(pg_stat_database_xact_commit{datname="erp_generic"}[5m]) + rate(pg_stat_database_xact_rollback{datname="erp_generic"}[5m])

4.4 Dashboard 3: Business Metrics

Key Panels:

  1. Sales Orders Created (Today): Total sales orders by status
  2. Purchase Orders Created (Today): Total purchase orders by status
  3. Revenue Trend (Last 30 days): Daily revenue by tenant
  4. Invoices Generated (Today): Total invoices by type (customer/supplier)
  5. Inventory Movements (Today): Stock in/out movements
  6. Top 10 Customers by Revenue: Revenue breakdown
  7. Order Fulfillment Rate: Percentage of orders fulfilled on time
  8. Average Order Value: Mean order value by tenant

Example Queries:

# Sales orders created today
increase(erp_sales_orders_created_total[1d])

# Revenue trend (requires custom metric)
sum by (tenant_id) (rate(erp_sales_order_amount_sum[1d]))

# Top 10 customers by revenue
topk(10, sum by (customer_id) (erp_sales_order_amount_sum))

5. ALERT RULES

5.1 Alertmanager Configuration

File: alertmanager/alertmanager.yml

global:
  resolve_timeout: 5m
  smtp_smarthost: '${SMTP_HOST}:${SMTP_PORT}'
  smtp_from: 'alertmanager@erp-generic.com'
  smtp_auth_username: '${SMTP_USER}'
  smtp_auth_password: '${SMTP_PASSWORD}'
  slack_api_url: '${SLACK_WEBHOOK_URL}'
  pagerduty_url: 'https://events.pagerduty.com/v2/enqueue'

# Route alerts to different receivers
route:
  receiver: 'default'
  group_by: ['alertname', 'cluster', 'service']
  group_wait: 10s
  group_interval: 10s
  repeat_interval: 12h

  routes:
    # Critical alerts → PagerDuty (on-call)
    - receiver: 'pagerduty'
      match:
        severity: critical
      continue: true

    # All alerts → Slack
    - receiver: 'slack'
      match_re:
        severity: critical|warning

    # Database alerts → DBA team
    - receiver: 'dba-email'
      match:
        component: postgres

    # Security alerts → Security team
    - receiver: 'security-email'
      match_re:
        alertname: '.*Security.*'

# Inhibition rules (suppress alerts)
inhibit_rules:
  # Suppress warning if critical already firing
  - source_match:
      severity: 'critical'
    target_match:
      severity: 'warning'
    equal: ['alertname', 'instance']

receivers:
  - name: 'default'
    email_configs:
      - to: 'devops@erp-generic.com'
        headers:
          Subject: '[ERP Alert] {{ .GroupLabels.alertname }}'

  - name: 'pagerduty'
    pagerduty_configs:
      - service_key: '${PAGERDUTY_SERVICE_KEY}'
        description: '{{ .GroupLabels.alertname }}: {{ .GroupLabels.instance }}'

  - name: 'slack'
    slack_configs:
      - channel: '#erp-alerts'
        title: '{{ .GroupLabels.alertname }}'
        text: '{{ range .Alerts }}{{ .Annotations.description }}{{ end }}'
        color: '{{ if eq .Status "firing" }}danger{{ else }}good{{ end }}'

  - name: 'dba-email'
    email_configs:
      - to: 'dba@erp-generic.com'
        headers:
          Subject: '[Database Alert] {{ .GroupLabels.alertname }}'

  - name: 'security-email'
    email_configs:
      - to: 'security@erp-generic.com'
        headers:
          Subject: '[SECURITY ALERT] {{ .GroupLabels.alertname }}'
          Priority: 'urgent'

5.2 Application Alert Rules

File: prometheus/alerts/application.yml

groups:
  - name: erp_application_alerts
    interval: 30s
    rules:
      # High Error Rate
      - alert: HighErrorRate
        expr: |
          (rate(erp_http_request_errors_total[5m]) / rate(erp_http_requests_total[5m])) > 0.05          
        for: 5m
        labels:
          severity: critical
          component: backend
        annotations:
          summary: "High error rate detected on {{ $labels.instance }}"
          description: "Error rate is {{ $value | humanizePercentage }} (threshold: 5%)"
          runbook: "https://wiki.erp-generic.com/runbooks/high-error-rate"

      # High P95 Latency
      - alert: HighLatency
        expr: |
          histogram_quantile(0.95, rate(erp_http_request_duration_seconds_bucket[5m])) > 0.5          
        for: 10m
        labels:
          severity: warning
          component: backend
        annotations:
          summary: "High P95 latency on {{ $labels.route }}"
          description: "P95 latency is {{ $value }}s (threshold: 500ms)"
          runbook: "https://wiki.erp-generic.com/runbooks/high-latency"

      # Service Down
      - alert: ServiceDown
        expr: up{job="erp-backend"} == 0
        for: 2m
        labels:
          severity: critical
          component: backend
        annotations:
          summary: "Backend service is down"
          description: "Backend {{ $labels.instance }} has been down for more than 2 minutes"
          runbook: "https://wiki.erp-generic.com/runbooks/service-down"

      # High CPU Usage
      - alert: HighCPUUsage
        expr: |
          (100 - (avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)) > 80          
        for: 10m
        labels:
          severity: warning
          component: infrastructure
        annotations:
          summary: "High CPU usage on {{ $labels.instance }}"
          description: "CPU usage is {{ $value }}% (threshold: 80%)"

      # High Memory Usage
      - alert: HighMemoryUsage
        expr: |
          (1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) > 0.85          
        for: 5m
        labels:
          severity: warning
          component: infrastructure
        annotations:
          summary: "High memory usage on {{ $labels.instance }}"
          description: "Memory usage is {{ $value | humanizePercentage }} (threshold: 85%)"

      # Disk Space Low
      - alert: DiskSpaceLow
        expr: |
          (node_filesystem_avail_bytes / node_filesystem_size_bytes) < 0.15          
        for: 5m
        labels:
          severity: warning
          component: infrastructure
        annotations:
          summary: "Low disk space on {{ $labels.instance }}"
          description: "Disk {{ $labels.mountpoint }} has only {{ $value | humanizePercentage }} free"

      # Too Many Requests (DDoS protection)
      - alert: TooManyRequests
        expr: |
          rate(erp_http_requests_total[1m]) > 10000          
        for: 2m
        labels:
          severity: critical
          component: security
        annotations:
          summary: "Abnormally high request rate detected"
          description: "Request rate is {{ $value }} req/s (threshold: 10000 req/s). Possible DDoS attack."
          runbook: "https://wiki.erp-generic.com/runbooks/ddos-attack"

      # Low Cache Hit Rate
      - alert: LowCacheHitRate
        expr: |
          (rate(erp_cache_hits_total[5m]) / (rate(erp_cache_hits_total[5m]) + rate(erp_cache_misses_total[5m]))) < 0.6          
        for: 15m
        labels:
          severity: warning
          component: cache
        annotations:
          summary: "Low cache hit rate"
          description: "Cache hit rate is {{ $value | humanizePercentage }} (threshold: 60%)"

5.3 Database Alert Rules

File: prometheus/alerts/database.yml

groups:
  - name: erp_database_alerts
    interval: 30s
    rules:
      # Database Down
      - alert: DatabaseDown
        expr: pg_up == 0
        for: 1m
        labels:
          severity: critical
          component: postgres
        annotations:
          summary: "PostgreSQL is down"
          description: "PostgreSQL on {{ $labels.instance }} has been down for more than 1 minute"
          runbook: "https://wiki.erp-generic.com/runbooks/database-down"

      # Connection Pool Exhausted
      - alert: ConnectionPoolExhausted
        expr: |
          (pg_stat_database_numbackends / pg_settings_max_connections) > 0.9          
        for: 2m
        labels:
          severity: critical
          component: postgres
        annotations:
          summary: "Database connection pool almost exhausted"
          description: "{{ $labels.datname }} is using {{ $value | humanizePercentage }} of max connections"
          runbook: "https://wiki.erp-generic.com/runbooks/connection-pool-exhausted"

      # Slow Queries
      - alert: SlowQueries
        expr: |
          rate(pg_stat_statements_mean_exec_time[5m]) > 1000          
        for: 10m
        labels:
          severity: warning
          component: postgres
        annotations:
          summary: "Slow database queries detected"
          description: "Mean query execution time is {{ $value }}ms (threshold: 1000ms)"
          runbook: "https://wiki.erp-generic.com/runbooks/slow-queries"

      # High Number of Deadlocks
      - alert: HighDeadlocks
        expr: |
          rate(pg_stat_database_deadlocks[5m]) > 5          
        for: 5m
        labels:
          severity: warning
          component: postgres
        annotations:
          summary: "High number of database deadlocks"
          description: "Deadlock rate is {{ $value }}/s (threshold: 5/s)"

      # Replication Lag (if using replicas)
      - alert: ReplicationLag
        expr: |
          pg_replication_lag_seconds > 60          
        for: 5m
        labels:
          severity: warning
          component: postgres
        annotations:
          summary: "Database replication lag is high"
          description: "Replication lag is {{ $value }}s (threshold: 60s)"

      # Disk Space Low (Database)
      - alert: DatabaseDiskSpaceLow
        expr: |
          (node_filesystem_avail_bytes{mountpoint="/var/lib/postgresql"} / node_filesystem_size_bytes{mountpoint="/var/lib/postgresql"}) < 0.15          
        for: 5m
        labels:
          severity: critical
          component: postgres
        annotations:
          summary: "Database disk space is low"
          description: "Only {{ $value | humanizePercentage }} disk space remaining"
          runbook: "https://wiki.erp-generic.com/runbooks/database-disk-full"

5.4 Business Alert Rules

File: prometheus/alerts/business.yml

groups:
  - name: erp_business_alerts
    interval: 1m
    rules:
      # No Sales Orders Created (Business Hours)
      - alert: NoSalesOrdersCreated
        expr: |
          increase(erp_sales_orders_created_total[1h]) == 0
          and ON() hour() >= 9 and ON() hour() < 18          
        for: 1h
        labels:
          severity: warning
          component: business
        annotations:
          summary: "No sales orders created in the last hour during business hours"
          description: "This might indicate a problem with the order creation system"

      # High Order Cancellation Rate
      - alert: HighOrderCancellationRate
        expr: |
          (rate(erp_sales_orders_created_total{status="cancelled"}[1h]) / rate(erp_sales_orders_created_total[1h])) > 0.2          
        for: 30m
        labels:
          severity: warning
          component: business
        annotations:
          summary: "High order cancellation rate"
          description: "{{ $value | humanizePercentage }} of orders are being cancelled (threshold: 20%)"

      # Failed Login Spike
      - alert: FailedLoginSpike
        expr: |
          rate(erp_login_failures_total[5m]) > 10          
        for: 5m
        labels:
          severity: warning
          component: security
        annotations:
          summary: "Spike in failed login attempts"
          description: "{{ $value }} failed logins per second (threshold: 10/s). Possible brute-force attack."
          runbook: "https://wiki.erp-generic.com/runbooks/brute-force-attack"

6. LOGGING STRATEGY

6.1 Winston Configuration

File: backend/src/common/logger/logger.service.ts

import { Injectable, LoggerService as NestLoggerService } from '@nestjs/common';
import * as winston from 'winston';
import 'winston-daily-rotate-file';

@Injectable()
export class LoggerService implements NestLoggerService {
  private logger: winston.Logger;

  constructor() {
    this.logger = winston.createLogger({
      level: process.env.LOG_LEVEL || 'info',
      format: winston.format.combine(
        winston.format.timestamp({ format: 'YYYY-MM-DD HH:mm:ss' }),
        winston.format.errors({ stack: true }),
        winston.format.splat(),
        winston.format.json(),
      ),
      defaultMeta: {
        service: 'erp-generic-backend',
        environment: process.env.NODE_ENV,
      },
      transports: [
        // Console transport (for development)
        new winston.transports.Console({
          format: winston.format.combine(
            winston.format.colorize(),
            winston.format.printf(({ timestamp, level, message, context, ...meta }) => {
              return `${timestamp} [${level}] [${context || 'Application'}] ${message} ${
                Object.keys(meta).length ? JSON.stringify(meta, null, 2) : ''
              }`;
            }),
          ),
        }),

        // File transport - All logs
        new winston.transports.DailyRotateFile({
          filename: 'logs/application-%DATE%.log',
          datePattern: 'YYYY-MM-DD',
          maxSize: '20m',
          maxFiles: '14d',
          zippedArchive: true,
        }),

        // File transport - Error logs only
        new winston.transports.DailyRotateFile({
          level: 'error',
          filename: 'logs/error-%DATE%.log',
          datePattern: 'YYYY-MM-DD',
          maxSize: '20m',
          maxFiles: '30d',
          zippedArchive: true,
        }),

        // File transport - Audit logs (security events)
        new winston.transports.DailyRotateFile({
          filename: 'logs/audit-%DATE%.log',
          datePattern: 'YYYY-MM-DD',
          maxSize: '50m',
          maxFiles: '90d', // Keep for 90 days (compliance)
          zippedArchive: true,
        }),
      ],
    });

    // Add Elasticsearch/Loki transport for production
    if (process.env.NODE_ENV === 'production') {
      // Example: Winston-Elasticsearch
      // this.logger.add(new WinstonElasticsearch({
      //   level: 'info',
      //   clientOpts: {
      //     node: process.env.ELASTICSEARCH_URL,
      //     auth: {
      //       username: process.env.ELASTICSEARCH_USER,
      //       password: process.env.ELASTICSEARCH_PASSWORD,
      //     },
      //   },
      //   index: 'erp-generic-logs',
      // }));
    }
  }

  log(message: string, context?: string, meta?: any) {
    this.logger.info(message, { context, ...meta });
  }

  error(message: string, trace?: string, context?: string, meta?: any) {
    this.logger.error(message, { trace, context, ...meta });
  }

  warn(message: string, context?: string, meta?: any) {
    this.logger.warn(message, { context, ...meta });
  }

  debug(message: string, context?: string, meta?: any) {
    this.logger.debug(message, { context, ...meta });
  }

  verbose(message: string, context?: string, meta?: any) {
    this.logger.verbose(message, { context, ...meta });
  }

  // Audit logging (security-sensitive events)
  audit(event: string, userId: string, tenantId: string, details: any) {
    this.logger.info('AUDIT_EVENT', {
      event,
      userId,
      tenantId,
      details,
      timestamp: new Date().toISOString(),
      ip: details.ip,
      userAgent: details.userAgent,
    });
  }
}

6.2 Structured Logging Examples

// Login attempt
logger.audit('USER_LOGIN', userId, tenantId, {
  method: 'email',
  ip: request.ip,
  userAgent: request.headers['user-agent'],
  success: true,
});

// Database query
logger.debug('DB_QUERY', 'DatabaseService', {
  operation: 'SELECT',
  table: 'auth.users',
  duration: 45, // ms
  rowCount: 1,
});

// API request
logger.info('HTTP_REQUEST', 'HttpMiddleware', {
  method: 'POST',
  path: '/api/sales/orders',
  statusCode: 201,
  duration: 234, // ms
  userId: '123e4567-e89b-12d3-a456-426614174000',
  tenantId: 'tenant-abc',
});

// Error with stack trace
logger.error('ORDER_CREATION_FAILED', error.stack, 'OrderService', {
  orderId: '123',
  tenantId: 'tenant-abc',
  error: error.message,
});

6.3 Log Aggregation (ELK Stack)

Docker Compose for ELK Stack:

version: '3.9'

services:
  elasticsearch:
    image: docker.elastic.co/elasticsearch/elasticsearch:8.10.0
    container_name: erp-elasticsearch
    environment:
      - discovery.type=single-node
      - ES_JAVA_OPTS=-Xms2g -Xmx2g
      - xpack.security.enabled=false
    volumes:
      - elasticsearch_data:/usr/share/elasticsearch/data
    ports:
      - "9200:9200"
    networks:
      - monitoring
    restart: always

  logstash:
    image: docker.elastic.co/logstash/logstash:8.10.0
    container_name: erp-logstash
    volumes:
      - ./logstash/logstash.conf:/usr/share/logstash/pipeline/logstash.conf:ro
    ports:
      - "5044:5044"
    environment:
      LS_JAVA_OPTS: "-Xmx512m -Xms512m"
    networks:
      - monitoring
    depends_on:
      - elasticsearch
    restart: always

  kibana:
    image: docker.elastic.co/kibana/kibana:8.10.0
    container_name: erp-kibana
    ports:
      - "5601:5601"
    environment:
      ELASTICSEARCH_URL: http://elasticsearch:9200
      ELASTICSEARCH_HOSTS: '["http://elasticsearch:9200"]'
    networks:
      - monitoring
    depends_on:
      - elasticsearch
    restart: always

volumes:
  elasticsearch_data:

networks:
  monitoring:
    external: true
    name: erp-monitoring

Logstash Configuration:

input {
  file {
    path => "/var/log/erp-generic/application-*.log"
    type => "application"
    codec => json
    start_position => "beginning"
  }

  file {
    path => "/var/log/erp-generic/error-*.log"
    type => "error"
    codec => json
    start_position => "beginning"
  }

  file {
    path => "/var/log/erp-generic/audit-*.log"
    type => "audit"
    codec => json
    start_position => "beginning"
  }
}

filter {
  # Parse timestamp
  date {
    match => [ "timestamp", "ISO8601" ]
    target => "@timestamp"
  }

  # Add geoip for IP addresses
  if [ip] {
    geoip {
      source => "ip"
      target => "geoip"
    }
  }

  # Extract tenant_id as a field
  if [tenantId] {
    mutate {
      add_field => { "tenant" => "%{tenantId}" }
    }
  }
}

output {
  elasticsearch {
    hosts => ["elasticsearch:9200"]
    index => "erp-generic-logs-%{+YYYY.MM.dd}"
  }

  # Debug output (optional)
  stdout {
    codec => rubydebug
  }
}

7. APPLICATION PERFORMANCE MONITORING (APM)

7.1 Custom Metrics Endpoints

File: backend/src/metrics/metrics.controller.ts

import { Controller, Get } from '@nestjs/common';
import { MetricsService } from '../common/metrics/metrics.service';
import { PrismaService } from '../common/prisma/prisma.service';

@Controller('metrics')
export class MetricsController {
  constructor(
    private metricsService: MetricsService,
    private prisma: PrismaService,
  ) {}

  @Get()
  getMetrics() {
    return this.metricsService.getMetrics();
  }

  @Get('business')
  async getBusinessMetrics() {
    // Aggregate business metrics from database
    const [salesOrders, purchaseOrders, invoices, activeUsers] = await Promise.all([
      this.prisma.salesOrder.count(),
      this.prisma.purchaseOrder.count(),
      this.prisma.invoice.count(),
      this.prisma.user.count({ where: { status: 'active' } }),
    ]);

    return {
      sales_orders_total: salesOrders,
      purchase_orders_total: purchaseOrders,
      invoices_total: invoices,
      active_users_total: activeUsers,
    };
  }
}

7.2 Performance Profiling

Prisma Query Logging:

// prisma/prisma.service.ts
import { Injectable, OnModuleInit } from '@nestjs/common';
import { PrismaClient } from '@prisma/client';
import { LoggerService } from '../logger/logger.service';

@Injectable()
export class PrismaService extends PrismaClient implements OnModuleInit {
  constructor(private logger: LoggerService) {
    super({
      log: [
        { emit: 'event', level: 'query' },
        { emit: 'event', level: 'error' },
        { emit: 'event', level: 'warn' },
      ],
    });

    // Log slow queries (>100ms)
    this.$on('query' as never, (e: any) => {
      if (e.duration > 100) {
        this.logger.warn('SLOW_QUERY', 'PrismaService', {
          query: e.query,
          duration: e.duration,
          params: e.params,
        });
      }
    });

    // Log query errors
    this.$on('error' as never, (e: any) => {
      this.logger.error('DB_ERROR', e.message, 'PrismaService', {
        target: e.target,
      });
    });
  }

  async onModuleInit() {
    await this.$connect();
  }
}

8. HEALTH CHECKS

8.1 Health Check Endpoints

// health/health.controller.ts
import { Controller, Get } from '@nestjs/common';
import { HealthCheck, HealthCheckService, PrismaHealthIndicator, MemoryHealthIndicator, DiskHealthIndicator } from '@nestjs/terminus';
import { RedisHealthIndicator } from './redis.health';

@Controller('health')
export class HealthController {
  constructor(
    private health: HealthCheckService,
    private db: PrismaHealthIndicator,
    private redis: RedisHealthIndicator,
    private memory: MemoryHealthIndicator,
    private disk: DiskHealthIndicator,
  ) {}

  @Get()
  @HealthCheck()
  check() {
    return this.health.check([
      () => this.db.pingCheck('database', { timeout: 3000 }),
      () => this.redis.isHealthy('redis'),
      () => this.memory.checkHeap('memory_heap', 200 * 1024 * 1024),
      () => this.disk.checkStorage('disk', { path: '/', thresholdPercent: 0.9 }),
    ]);
  }

  @Get('live')
  liveness() {
    return { status: 'ok', timestamp: new Date().toISOString() };
  }

  @Get('ready')
  @HealthCheck()
  readiness() {
    return this.health.check([
      () => this.db.pingCheck('database'),
      () => this.redis.isHealthy('redis'),
    ]);
  }
}

9. DISTRIBUTED TRACING

9.1 OpenTelemetry Setup

// tracing.ts (Bootstrap file)
import { NodeSDK } from '@opentelemetry/sdk-node';
import { getNodeAutoInstrumentations } from '@opentelemetry/auto-instrumentations-node';
import { JaegerExporter } from '@opentelemetry/exporter-jaeger';

const sdk = new NodeSDK({
  traceExporter: new JaegerExporter({
    endpoint: process.env.JAEGER_ENDPOINT || 'http://jaeger:14268/api/traces',
  }),
  instrumentations: [getNodeAutoInstrumentations()],
});

sdk.start();

10. ON-CALL & INCIDENT RESPONSE

10.1 On-Call Rotation

  • Primary On-Call: DevOps Engineer (24/7)
  • Secondary On-Call: Backend Lead
  • Escalation Path: CTO → CEO

10.2 Incident Severity

Severity Response Time Examples
P0 (Critical) 15 min System down, data loss
P1 (High) 1 hour Major feature broken
P2 (Medium) 4 hours Minor feature broken
P3 (Low) 24 hours Cosmetic issue

11. REFERENCES


Documento: MONITORING-OBSERVABILITY.md Versión: 1.0 Total Páginas: ~18 Última Actualización: 2025-11-24