erp-core/docs/07-devops/MONITORING-OBSERVABILITY.md

# MONITORING & OBSERVABILITY - ERP Generic

**Última actualización:** 2025-11-24
**Responsable:** DevOps Team / SRE Team
**Estado:** ✅ Production-Ready

---

## TABLE OF CONTENTS

1. [Overview](#1-overview)
2. [Observability Pillars](#2-observability-pillars)
3. [Prometheus Setup](#3-prometheus-setup)
4. [Grafana Dashboards](#4-grafana-dashboards)
5. [Alert Rules](#5-alert-rules)
6. [Logging Strategy](#6-logging-strategy)
7. [Application Performance Monitoring (APM)](#7-application-performance-monitoring-apm)
8. [Health Checks](#8-health-checks)
9. [Distributed Tracing](#9-distributed-tracing)
10. [On-Call & Incident Response](#10-on-call--incident-response)
11. [References](#11-references)

---

## 1. OVERVIEW

### 1.1 Monitoring Architecture

```
┌─────────────────────────────────────────────────────────────────────┐
│                         Application Layer                           │
│  ┌──────────┐  ┌──────────┐  ┌──────────┐  ┌──────────┐           │
│  │ Backend  │  │ Frontend │  │ Postgres │  │  Redis   │           │
│  │ (Metrics)│  │ (Metrics)│  │(Exporter)│  │(Exporter)│           │
│  └────┬─────┘  └────┬─────┘  └────┬─────┘  └────┬─────┘           │
│       │             │             │             │                   │
│       └─────────────┴─────────────┴─────────────┘                   │
│                            │                                         │
└────────────────────────────┼─────────────────────────────────────────┘
                             │ (Scrape metrics every 15s)
                             ↓
┌─────────────────────────────────────────────────────────────────────┐
│                      Prometheus (TSDB)                              │
│  - Collects metrics from all targets                               │
│  - Evaluates alert rules                                           │
│  - Stores time-series data (15 days retention)                     │
└────────┬────────────────────────────────┬─────────────────────────┘
         │                                │
         │ (Query metrics)                │ (Send alerts)
         ↓                                ↓
┌─────────────────────┐         ┌──────────────────────┐
│   Grafana           │         │  Alertmanager        │
│   - Dashboards      │         │  - Route alerts      │
│   - Visualization   │         │  - Deduplication     │
│   - Alerting        │         │  - Silencing         │
└─────────────────────┘         └──────┬───────────────┘
                                       │
                   ┌───────────────────┼────────────────┐
                   ↓                   ↓                ↓
            ┌──────────┐        ┌──────────┐    ┌──────────┐
            │ PagerDuty│        │  Slack   │    │  Email   │
            │(On-call) │        │(#alerts) │    │(Team)    │
            └──────────┘        └──────────┘    └──────────┘

┌─────────────────────────────────────────────────────────────────────┐
│                      Logging Pipeline                               │
│                                                                     │
│  Application → Winston → ELK Stack / Loki                          │
│                                                                     │
│  ┌──────────┐      ┌──────────────┐      ┌──────────┐             │
│  │  Logs    │ ───→ │ Elasticsearch│ ───→ │  Kibana  │             │
│  │(JSON)    │      │  or Loki     │      │(Search)  │             │
│  └──────────┘      └──────────────┘      └──────────┘             │
└─────────────────────────────────────────────────────────────────────┘

┌─────────────────────────────────────────────────────────────────────┐
│                     Distributed Tracing                             │
│                                                                     │
│  Application → OpenTelemetry → Jaeger / Tempo                      │
│  (Trace spans for requests across microservices)                   │
└─────────────────────────────────────────────────────────────────────┘
```

### 1.2 Observability Goals

**Why Observability?**
- **Proactive Monitoring:** Detect issues before users report them
- **Faster Debugging:** Reduce MTTD (Mean Time to Detect) from hours to minutes
- **Performance Optimization:** Identify bottlenecks and slow queries
- **Capacity Planning:** Predict when to scale resources
- **SLA Compliance:** Monitor uptime, response times, error rates

**Key Metrics (Google's Four Golden Signals):**
1. **Latency:** Request/response time (p50, p95, p99)
2. **Traffic:** Requests per second (throughput)
3. **Errors:** Error rate (5xx responses, exceptions)
4. **Saturation:** Resource utilization (CPU, memory, disk, DB connections)

**SLOs (Service Level Objectives):**
- **Availability:** 99.9% uptime (8.76 hours downtime/year)
- **Latency:** p95 API response < 300ms
- **Error Budget:** <0.1% error rate
- **Data Durability:** Zero data loss

---

## 2. OBSERVABILITY PILLARS

### 2.1 The Three Pillars

**1. Metrics (What is happening?)**
- Quantitative measurements over time
- Examples: CPU usage, request count, response time
- Tool: Prometheus + Grafana

**2. Logs (What happened?)**
- Discrete events with context
- Examples: "User X logged in", "Query took 2.5s"
- Tool: Winston + ELK Stack / Loki

**3. Traces (Why did it happen?)**
- Request flow across services
- Examples: API call → Database query → Redis cache → Response
- Tool: OpenTelemetry + Jaeger

### 2.2 Correlation

```
Example: High p99 latency alert
├── Metrics: p99 latency = 3s (threshold: 500ms)
│   └── Which endpoint? /api/products
│
├── Logs: Search for slow queries in /api/products
│   └── Found: SELECT * FROM inventory.stock_movements (2.8s)
│
└── Traces: Trace ID abc123 shows:
    ├── API handler: 50ms
    ├── Database query: 2800ms ← Bottleneck!
    └── Response serialization: 150ms

Root cause: Missing index on inventory.stock_movements(product_id)
Fix: CREATE INDEX idx_stock_movements_product_id ON inventory.stock_movements(product_id);
```

---

## 3. PROMETHEUS SETUP

### 3.1 Prometheus Configuration

**File:** `prometheus/prometheus.yml`

```yaml
global:
  scrape_interval: 15s              # Scrape targets every 15 seconds
  evaluation_interval: 15s          # Evaluate rules every 15 seconds
  scrape_timeout: 10s
  external_labels:
    cluster: 'erp-generic-prod'
    environment: 'production'

# Alertmanager configuration
alerting:
  alertmanagers:
    - static_configs:
        - targets:
            - alertmanager:9093
      timeout: 10s

# Load alert rules
rule_files:
  - '/etc/prometheus/alerts/application.yml'
  - '/etc/prometheus/alerts/infrastructure.yml'
  - '/etc/prometheus/alerts/database.yml'
  - '/etc/prometheus/alerts/business.yml'

# Scrape configurations
scrape_configs:
  # Backend API (NestJS with Prometheus middleware)
  - job_name: 'erp-backend'
    static_configs:
      - targets: ['backend:3000']
        labels:
          service: 'backend'
          component: 'api'
    metrics_path: '/metrics'
    scrape_interval: 15s

  # PostgreSQL Exporter
  - job_name: 'postgres'
    static_configs:
      - targets: ['postgres-exporter:9187']
        labels:
          service: 'database'
          component: 'postgres'
    scrape_interval: 30s

  # Redis Exporter
  - job_name: 'redis'
    static_configs:
      - targets: ['redis-exporter:9121']
        labels:
          service: 'cache'
          component: 'redis'
    scrape_interval: 30s

  # Node Exporter (system metrics)
  - job_name: 'node'
    static_configs:
      - targets: ['node-exporter:9100']
        labels:
          service: 'infrastructure'
          component: 'host'
    scrape_interval: 15s

  # Frontend (Nginx metrics)
  - job_name: 'nginx'
    static_configs:
      - targets: ['nginx-exporter:9113']
        labels:
          service: 'frontend'
          component: 'nginx'
    scrape_interval: 30s

  # Prometheus itself (meta-monitoring)
  - job_name: 'prometheus'
    static_configs:
      - targets: ['localhost:9090']
        labels:
          service: 'monitoring'
          component: 'prometheus'
```

### 3.2 Docker Compose for Monitoring Stack

**File:** `docker-compose.monitoring.yml`

```yaml
version: '3.9'

services:
  prometheus:
    image: prom/prometheus:v2.47.0
    container_name: erp-prometheus
    volumes:
      - ./prometheus/prometheus.yml:/etc/prometheus/prometheus.yml:ro
      - ./prometheus/alerts:/etc/prometheus/alerts:ro
      - prometheus_data:/prometheus
    command:
      - '--config.file=/etc/prometheus/prometheus.yml'
      - '--storage.tsdb.path=/prometheus'
      - '--storage.tsdb.retention.time=15d'
      - '--web.console.libraries=/usr/share/prometheus/console_libraries'
      - '--web.console.templates=/usr/share/prometheus/consoles'
      - '--web.enable-lifecycle'
    ports:
      - "9090:9090"
    networks:
      - monitoring
    restart: always

  alertmanager:
    image: prom/alertmanager:v0.26.0
    container_name: erp-alertmanager
    volumes:
      - ./alertmanager/alertmanager.yml:/etc/alertmanager/alertmanager.yml:ro
      - alertmanager_data:/alertmanager
    command:
      - '--config.file=/etc/alertmanager/alertmanager.yml'
      - '--storage.path=/alertmanager'
    ports:
      - "9093:9093"
    networks:
      - monitoring
    restart: always

  grafana:
    image: grafana/grafana:10.1.0
    container_name: erp-grafana
    environment:
      - GF_SECURITY_ADMIN_USER=${GRAFANA_ADMIN_USER:-admin}
      - GF_SECURITY_ADMIN_PASSWORD=${GRAFANA_ADMIN_PASSWORD:-admin}
      - GF_INSTALL_PLUGINS=grafana-piechart-panel,grafana-clock-panel
      - GF_SERVER_ROOT_URL=https://grafana.erp-generic.com
      - GF_SMTP_ENABLED=true
      - GF_SMTP_HOST=${SMTP_HOST}:${SMTP_PORT}
      - GF_SMTP_USER=${SMTP_USER}
      - GF_SMTP_PASSWORD=${SMTP_PASSWORD}
    volumes:
      - grafana_data:/var/lib/grafana
      - ./grafana/provisioning:/etc/grafana/provisioning:ro
      - ./grafana/dashboards:/var/lib/grafana/dashboards:ro
    ports:
      - "3001:3000"
    networks:
      - monitoring
    depends_on:
      - prometheus
    restart: always

  postgres-exporter:
    image: prometheuscommunity/postgres-exporter:v0.14.0
    container_name: erp-postgres-exporter
    environment:
      DATA_SOURCE_NAME: "postgresql://${POSTGRES_USER}:${POSTGRES_PASSWORD}@postgres:5432/${POSTGRES_DB}?sslmode=disable"
    ports:
      - "9187:9187"
    networks:
      - monitoring
      - erp-network
    restart: always

  redis-exporter:
    image: oliver006/redis_exporter:v1.54.0
    container_name: erp-redis-exporter
    environment:
      REDIS_ADDR: "redis:6379"
      REDIS_PASSWORD: ${REDIS_PASSWORD}
    ports:
      - "9121:9121"
    networks:
      - monitoring
      - erp-network
    restart: always

  node-exporter:
    image: prom/node-exporter:v1.6.1
    container_name: erp-node-exporter
    command:
      - '--path.procfs=/host/proc'
      - '--path.sysfs=/host/sys'
      - '--collector.filesystem.mount-points-exclude=^/(sys|proc|dev|host|etc)($$|/)'
    volumes:
      - /proc:/host/proc:ro
      - /sys:/host/sys:ro
      - /:/rootfs:ro
    ports:
      - "9100:9100"
    networks:
      - monitoring
    restart: always

volumes:
  prometheus_data:
  alertmanager_data:
  grafana_data:

networks:
  monitoring:
    name: erp-monitoring
  erp-network:
    external: true
    name: erp-network-internal
```

### 3.3 Backend Metrics Instrumentation

**File:** `backend/src/common/metrics/metrics.module.ts`

```typescript
import { Module } from '@nestjs/common';
import { PrometheusModule } from '@willsoto/nestjs-prometheus';
import { MetricsService } from './metrics.service';

@Module({
  imports: [
    PrometheusModule.register({
      path: '/metrics',
      defaultMetrics: {
        enabled: true,
        config: {
          prefix: 'erp_',
        },
      },
    }),
  ],
  providers: [MetricsService],
  exports: [MetricsService],
})
export class MetricsModule {}
```

**File:** `backend/src/common/metrics/metrics.service.ts`

```typescript
import { Injectable } from '@nestjs/common';
import { Counter, Histogram, Gauge, Registry } from 'prom-client';

@Injectable()
export class MetricsService {
  private registry: Registry;

  // HTTP Metrics
  private httpRequestDuration: Histogram;
  private httpRequestTotal: Counter;
  private httpRequestErrors: Counter;

  // Database Metrics
  private dbQueryDuration: Histogram;
  private dbConnectionsActive: Gauge;
  private dbQueryErrors: Counter;

  // Business Metrics
  private salesOrdersCreated: Counter;
  private purchaseOrdersCreated: Counter;
  private invoicesGenerated: Counter;
  private inventoryMovements: Counter;

  // Cache Metrics
  private cacheHits: Counter;
  private cacheMisses: Counter;

  // Authentication Metrics
  private loginAttempts: Counter;
  private loginFailures: Counter;
  private activeUsers: Gauge;

  constructor() {
    this.registry = new Registry();
    this.initializeMetrics();
  }

  private initializeMetrics() {
    // HTTP Request Duration
    this.httpRequestDuration = new Histogram({
      name: 'erp_http_request_duration_seconds',
      help: 'Duration of HTTP requests in seconds',
      labelNames: ['method', 'route', 'status_code'],
      buckets: [0.01, 0.05, 0.1, 0.3, 0.5, 1, 2, 5],
    });

    // HTTP Request Total
    this.httpRequestTotal = new Counter({
      name: 'erp_http_requests_total',
      help: 'Total number of HTTP requests',
      labelNames: ['method', 'route', 'status_code'],
    });

    // HTTP Request Errors
    this.httpRequestErrors = new Counter({
      name: 'erp_http_request_errors_total',
      help: 'Total number of HTTP request errors',
      labelNames: ['method', 'route', 'error_type'],
    });

    // Database Query Duration
    this.dbQueryDuration = new Histogram({
      name: 'erp_db_query_duration_seconds',
      help: 'Duration of database queries in seconds',
      labelNames: ['operation', 'table'],
      buckets: [0.001, 0.005, 0.01, 0.05, 0.1, 0.5, 1, 2],
    });

    // Database Active Connections
    this.dbConnectionsActive = new Gauge({
      name: 'erp_db_connections_active',
      help: 'Number of active database connections',
    });

    // Database Query Errors
    this.dbQueryErrors = new Counter({
      name: 'erp_db_query_errors_total',
      help: 'Total number of database query errors',
      labelNames: ['operation', 'error_type'],
    });

    // Business Metrics - Sales Orders
    this.salesOrdersCreated = new Counter({
      name: 'erp_sales_orders_created_total',
      help: 'Total number of sales orders created',
      labelNames: ['tenant_id', 'status'],
    });

    // Business Metrics - Purchase Orders
    this.purchaseOrdersCreated = new Counter({
      name: 'erp_purchase_orders_created_total',
      help: 'Total number of purchase orders created',
      labelNames: ['tenant_id', 'status'],
    });

    // Business Metrics - Invoices
    this.invoicesGenerated = new Counter({
      name: 'erp_invoices_generated_total',
      help: 'Total number of invoices generated',
      labelNames: ['tenant_id', 'type'],
    });

    // Business Metrics - Inventory Movements
    this.inventoryMovements = new Counter({
      name: 'erp_inventory_movements_total',
      help: 'Total number of inventory movements',
      labelNames: ['tenant_id', 'type'],
    });

    // Cache Hits
    this.cacheHits = new Counter({
      name: 'erp_cache_hits_total',
      help: 'Total number of cache hits',
      labelNames: ['cache_key'],
    });

    // Cache Misses
    this.cacheMisses = new Counter({
      name: 'erp_cache_misses_total',
      help: 'Total number of cache misses',
      labelNames: ['cache_key'],
    });

    // Login Attempts
    this.loginAttempts = new Counter({
      name: 'erp_login_attempts_total',
      help: 'Total number of login attempts',
      labelNames: ['tenant_id', 'method'],
    });

    // Login Failures
    this.loginFailures = new Counter({
      name: 'erp_login_failures_total',
      help: 'Total number of failed login attempts',
      labelNames: ['tenant_id', 'reason'],
    });

    // Active Users
    this.activeUsers = new Gauge({
      name: 'erp_active_users',
      help: 'Number of currently active users',
      labelNames: ['tenant_id'],
    });

    // Register all metrics
    this.registry.registerMetric(this.httpRequestDuration);
    this.registry.registerMetric(this.httpRequestTotal);
    this.registry.registerMetric(this.httpRequestErrors);
    this.registry.registerMetric(this.dbQueryDuration);
    this.registry.registerMetric(this.dbConnectionsActive);
    this.registry.registerMetric(this.dbQueryErrors);
    this.registry.registerMetric(this.salesOrdersCreated);
    this.registry.registerMetric(this.purchaseOrdersCreated);
    this.registry.registerMetric(this.invoicesGenerated);
    this.registry.registerMetric(this.inventoryMovements);
    this.registry.registerMetric(this.cacheHits);
    this.registry.registerMetric(this.cacheMisses);
    this.registry.registerMetric(this.loginAttempts);
    this.registry.registerMetric(this.loginFailures);
    this.registry.registerMetric(this.activeUsers);
  }

  // Public methods to record metrics
  recordHttpRequest(method: string, route: string, statusCode: number, duration: number) {
    this.httpRequestDuration.observe({ method, route, status_code: statusCode }, duration);
    this.httpRequestTotal.inc({ method, route, status_code: statusCode });
  }

  recordHttpError(method: string, route: string, errorType: string) {
    this.httpRequestErrors.inc({ method, route, error_type: errorType });
  }

  recordDbQuery(operation: string, table: string, duration: number) {
    this.dbQueryDuration.observe({ operation, table }, duration);
  }

  recordDbError(operation: string, errorType: string) {
    this.dbQueryErrors.inc({ operation, error_type: errorType });
  }

  setDbConnectionsActive(count: number) {
    this.dbConnectionsActive.set(count);
  }

  recordSalesOrder(tenantId: string, status: string) {
    this.salesOrdersCreated.inc({ tenant_id: tenantId, status });
  }

  recordPurchaseOrder(tenantId: string, status: string) {
    this.purchaseOrdersCreated.inc({ tenant_id: tenantId, status });
  }

  recordInvoice(tenantId: string, type: string) {
    this.invoicesGenerated.inc({ tenant_id: tenantId, type });
  }

  recordInventoryMovement(tenantId: string, type: string) {
    this.inventoryMovements.inc({ tenant_id: tenantId, type });
  }

  recordCacheHit(key: string) {
    this.cacheHits.inc({ cache_key: key });
  }

  recordCacheMiss(key: string) {
    this.cacheMisses.inc({ cache_key: key });
  }

  recordLoginAttempt(tenantId: string, method: string) {
    this.loginAttempts.inc({ tenant_id: tenantId, method });
  }

  recordLoginFailure(tenantId: string, reason: string) {
    this.loginFailures.inc({ tenant_id: tenantId, reason });
  }

  setActiveUsers(tenantId: string, count: number) {
    this.activeUsers.set({ tenant_id: tenantId }, count);
  }

  getMetrics(): string {
    return this.registry.metrics();
  }
}
```

**File:** `backend/src/common/interceptors/metrics.interceptor.ts`

```typescript
import { Injectable, NestInterceptor, ExecutionContext, CallHandler } from '@nestjs/common';
import { Observable } from 'rxjs';
import { tap } from 'rxjs/operators';
import { MetricsService } from '../metrics/metrics.service';

@Injectable()
export class MetricsInterceptor implements NestInterceptor {
  constructor(private metricsService: MetricsService) {}

  intercept(context: ExecutionContext, next: CallHandler): Observable<any> {
    const request = context.switchToHttp().getRequest();
    const startTime = Date.now();

    return next.handle().pipe(
      tap({
        next: () => {
          const response = context.switchToHttp().getResponse();
          const duration = (Date.now() - startTime) / 1000; // Convert to seconds

          this.metricsService.recordHttpRequest(
            request.method,
            request.route?.path || request.url,
            response.statusCode,
            duration,
          );
        },
        error: (error) => {
          const duration = (Date.now() - startTime) / 1000;
          const response = context.switchToHttp().getResponse();

          this.metricsService.recordHttpRequest(
            request.method,
            request.route?.path || request.url,
            response.statusCode || 500,
            duration,
          );

          this.metricsService.recordHttpError(
            request.method,
            request.route?.path || request.url,
            error.name || 'UnknownError',
          );
        },
      }),
    );
  }
}
```

---

## 4. GRAFANA DASHBOARDS

### 4.1 Dashboard Provisioning

**File:** `grafana/provisioning/datasources/prometheus.yml`

```yaml
apiVersion: 1

datasources:
  - name: Prometheus
    type: prometheus
    access: proxy
    url: http://prometheus:9090
    isDefault: true
    editable: false
    jsonData:
      timeInterval: "15s"
      queryTimeout: "60s"
      httpMethod: "POST"
```

**File:** `grafana/provisioning/dashboards/dashboard-provider.yml`

```yaml
apiVersion: 1

providers:
  - name: 'ERP Generic Dashboards'
    orgId: 1
    folder: ''
    type: file
    disableDeletion: false
    updateIntervalSeconds: 10
    allowUiUpdates: true
    options:
      path: /var/lib/grafana/dashboards
      foldersFromFilesStructure: true
```

### 4.2 Dashboard 1: Application Performance

**File:** `grafana/dashboards/application-performance.json` (Simplified structure)

```json
{
  "dashboard": {
    "title": "ERP Generic - Application Performance",
    "tags": ["erp", "application", "performance"],
    "timezone": "browser",
    "panels": [
      {
        "title": "Request Rate (req/s)",
        "type": "graph",
        "targets": [
          {
            "expr": "rate(erp_http_requests_total[5m])",
            "legendFormat": "{{method}} {{route}}"
          }
        ]
      },
      {
        "title": "P95 Latency (ms)",
        "type": "graph",
        "targets": [
          {
            "expr": "histogram_quantile(0.95, rate(erp_http_request_duration_seconds_bucket[5m])) * 1000",
            "legendFormat": "{{route}}"
          }
        ],
        "thresholds": [
          { "value": 300, "color": "yellow" },
          { "value": 500, "color": "red" }
        ]
      },
      {
        "title": "Error Rate (%)",
        "type": "graph",
        "targets": [
          {
            "expr": "rate(erp_http_request_errors_total[5m]) / rate(erp_http_requests_total[5m]) * 100",
            "legendFormat": "{{route}}"
          }
        ],
        "thresholds": [
          { "value": 1, "color": "yellow" },
          { "value": 5, "color": "red" }
        ]
      },
      {
        "title": "Top 10 Slowest Endpoints",
        "type": "table",
        "targets": [
          {
            "expr": "topk(10, avg by (route) (erp_http_request_duration_seconds))",
            "format": "table"
          }
        ]
      },
      {
        "title": "Active Users by Tenant",
        "type": "graph",
        "targets": [
          {
            "expr": "erp_active_users",
            "legendFormat": "{{tenant_id}}"
          }
        ]
      },
      {
        "title": "Cache Hit Rate (%)",
        "type": "stat",
        "targets": [
          {
            "expr": "rate(erp_cache_hits_total[5m]) / (rate(erp_cache_hits_total[5m]) + rate(erp_cache_misses_total[5m])) * 100"
          }
        ]
      }
    ]
  }
}
```

**Key Panels:**
1. **Request Rate:** Total requests per second (by method and route)
2. **P95 Latency:** 95th percentile response time (threshold: 300ms yellow, 500ms red)
3. **Error Rate:** Percentage of failed requests (threshold: 1% yellow, 5% red)
4. **Top 10 Slowest Endpoints:** Identify performance bottlenecks
5. **Active Users by Tenant:** Real-time active user count per tenant
6. **Cache Hit Rate:** Percentage of cache hits (target: >80%)

### 4.3 Dashboard 2: Database Performance

**Key Panels:**
1. **Database Connections:** Active vs. max connections
2. **Query Duration P95:** 95th percentile query time by table
3. **Slow Queries:** Queries taking >1 second
4. **Transactions per Second:** TPS rate
5. **Database Size:** Disk usage by schema
6. **Index Usage:** Most and least used indexes
7. **Lock Waits:** Blocking queries
8. **Replication Lag:** Lag between primary and replicas (if applicable)

**Example Queries:**
```promql
# Active connections
pg_stat_database_numbackends{datname="erp_generic"}

# Slow queries (>1s)
rate(pg_stat_statements_mean_exec_time{datname="erp_generic"}[5m]) > 1000

# Database size
pg_database_size_bytes{datname="erp_generic"}

# TPS
rate(pg_stat_database_xact_commit{datname="erp_generic"}[5m]) + rate(pg_stat_database_xact_rollback{datname="erp_generic"}[5m])
```

### 4.4 Dashboard 3: Business Metrics

**Key Panels:**
1. **Sales Orders Created (Today):** Total sales orders by status
2. **Purchase Orders Created (Today):** Total purchase orders by status
3. **Revenue Trend (Last 30 days):** Daily revenue by tenant
4. **Invoices Generated (Today):** Total invoices by type (customer/supplier)
5. **Inventory Movements (Today):** Stock in/out movements
6. **Top 10 Customers by Revenue:** Revenue breakdown
7. **Order Fulfillment Rate:** Percentage of orders fulfilled on time
8. **Average Order Value:** Mean order value by tenant

**Example Queries:**
```promql
# Sales orders created today
increase(erp_sales_orders_created_total[1d])

# Revenue trend (requires custom metric)
sum by (tenant_id) (rate(erp_sales_order_amount_sum[1d]))

# Top 10 customers by revenue
topk(10, sum by (customer_id) (erp_sales_order_amount_sum))
```

---

## 5. ALERT RULES

### 5.1 Alertmanager Configuration

**File:** `alertmanager/alertmanager.yml`

```yaml
global:
  resolve_timeout: 5m
  smtp_smarthost: '${SMTP_HOST}:${SMTP_PORT}'
  smtp_from: 'alertmanager@erp-generic.com'
  smtp_auth_username: '${SMTP_USER}'
  smtp_auth_password: '${SMTP_PASSWORD}'
  slack_api_url: '${SLACK_WEBHOOK_URL}'
  pagerduty_url: 'https://events.pagerduty.com/v2/enqueue'

# Route alerts to different receivers
route:
  receiver: 'default'
  group_by: ['alertname', 'cluster', 'service']
  group_wait: 10s
  group_interval: 10s
  repeat_interval: 12h

  routes:
    # Critical alerts → PagerDuty (on-call)
    - receiver: 'pagerduty'
      match:
        severity: critical
      continue: true

    # All alerts → Slack
    - receiver: 'slack'
      match_re:
        severity: critical|warning

    # Database alerts → DBA team
    - receiver: 'dba-email'
      match:
        component: postgres

    # Security alerts → Security team
    - receiver: 'security-email'
      match_re:
        alertname: '.*Security.*'

# Inhibition rules (suppress alerts)
inhibit_rules:
  # Suppress warning if critical already firing
  - source_match:
      severity: 'critical'
    target_match:
      severity: 'warning'
    equal: ['alertname', 'instance']

receivers:
  - name: 'default'
    email_configs:
      - to: 'devops@erp-generic.com'
        headers:
          Subject: '[ERP Alert] {{ .GroupLabels.alertname }}'

  - name: 'pagerduty'
    pagerduty_configs:
      - service_key: '${PAGERDUTY_SERVICE_KEY}'
        description: '{{ .GroupLabels.alertname }}: {{ .GroupLabels.instance }}'

  - name: 'slack'
    slack_configs:
      - channel: '#erp-alerts'
        title: '{{ .GroupLabels.alertname }}'
        text: '{{ range .Alerts }}{{ .Annotations.description }}{{ end }}'
        color: '{{ if eq .Status "firing" }}danger{{ else }}good{{ end }}'

  - name: 'dba-email'
    email_configs:
      - to: 'dba@erp-generic.com'
        headers:
          Subject: '[Database Alert] {{ .GroupLabels.alertname }}'

  - name: 'security-email'
    email_configs:
      - to: 'security@erp-generic.com'
        headers:
          Subject: '[SECURITY ALERT] {{ .GroupLabels.alertname }}'
          Priority: 'urgent'
```

### 5.2 Application Alert Rules

**File:** `prometheus/alerts/application.yml`

```yaml
groups:
  - name: erp_application_alerts
    interval: 30s
    rules:
      # High Error Rate
      - alert: HighErrorRate
        expr: |
          (rate(erp_http_request_errors_total[5m]) / rate(erp_http_requests_total[5m])) > 0.05
        for: 5m
        labels:
          severity: critical
          component: backend
        annotations:
          summary: "High error rate detected on {{ $labels.instance }}"
          description: "Error rate is {{ $value | humanizePercentage }} (threshold: 5%)"
          runbook: "https://wiki.erp-generic.com/runbooks/high-error-rate"

      # High P95 Latency
      - alert: HighLatency
        expr: |
          histogram_quantile(0.95, rate(erp_http_request_duration_seconds_bucket[5m])) > 0.5
        for: 10m
        labels:
          severity: warning
          component: backend
        annotations:
          summary: "High P95 latency on {{ $labels.route }}"
          description: "P95 latency is {{ $value }}s (threshold: 500ms)"
          runbook: "https://wiki.erp-generic.com/runbooks/high-latency"

      # Service Down
      - alert: ServiceDown
        expr: up{job="erp-backend"} == 0
        for: 2m
        labels:
          severity: critical
          component: backend
        annotations:
          summary: "Backend service is down"
          description: "Backend {{ $labels.instance }} has been down for more than 2 minutes"
          runbook: "https://wiki.erp-generic.com/runbooks/service-down"

      # High CPU Usage
      - alert: HighCPUUsage
        expr: |
          (100 - (avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)) > 80
        for: 10m
        labels:
          severity: warning
          component: infrastructure
        annotations:
          summary: "High CPU usage on {{ $labels.instance }}"
          description: "CPU usage is {{ $value }}% (threshold: 80%)"

      # High Memory Usage
      - alert: HighMemoryUsage
        expr: |
          (1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) > 0.85
        for: 5m
        labels:
          severity: warning
          component: infrastructure
        annotations:
          summary: "High memory usage on {{ $labels.instance }}"
          description: "Memory usage is {{ $value | humanizePercentage }} (threshold: 85%)"

      # Disk Space Low
      - alert: DiskSpaceLow
        expr: |
          (node_filesystem_avail_bytes / node_filesystem_size_bytes) < 0.15
        for: 5m
        labels:
          severity: warning
          component: infrastructure
        annotations:
          summary: "Low disk space on {{ $labels.instance }}"
          description: "Disk {{ $labels.mountpoint }} has only {{ $value | humanizePercentage }} free"

      # Too Many Requests (DDoS protection)
      - alert: TooManyRequests
        expr: |
          rate(erp_http_requests_total[1m]) > 10000
        for: 2m
        labels:
          severity: critical
          component: security
        annotations:
          summary: "Abnormally high request rate detected"
          description: "Request rate is {{ $value }} req/s (threshold: 10000 req/s). Possible DDoS attack."
          runbook: "https://wiki.erp-generic.com/runbooks/ddos-attack"

      # Low Cache Hit Rate
      - alert: LowCacheHitRate
        expr: |
          (rate(erp_cache_hits_total[5m]) / (rate(erp_cache_hits_total[5m]) + rate(erp_cache_misses_total[5m]))) < 0.6
        for: 15m
        labels:
          severity: warning
          component: cache
        annotations:
          summary: "Low cache hit rate"
          description: "Cache hit rate is {{ $value | humanizePercentage }} (threshold: 60%)"
```

### 5.3 Database Alert Rules

**File:** `prometheus/alerts/database.yml`

```yaml
groups:
  - name: erp_database_alerts
    interval: 30s
    rules:
      # Database Down
      - alert: DatabaseDown
        expr: pg_up == 0
        for: 1m
        labels:
          severity: critical
          component: postgres
        annotations:
          summary: "PostgreSQL is down"
          description: "PostgreSQL on {{ $labels.instance }} has been down for more than 1 minute"
          runbook: "https://wiki.erp-generic.com/runbooks/database-down"

      # Connection Pool Exhausted
      - alert: ConnectionPoolExhausted
        expr: |
          (pg_stat_database_numbackends / pg_settings_max_connections) > 0.9
        for: 2m
        labels:
          severity: critical
          component: postgres
        annotations:
          summary: "Database connection pool almost exhausted"
          description: "{{ $labels.datname }} is using {{ $value | humanizePercentage }} of max connections"
          runbook: "https://wiki.erp-generic.com/runbooks/connection-pool-exhausted"

      # Slow Queries
      - alert: SlowQueries
        expr: |
          rate(pg_stat_statements_mean_exec_time[5m]) > 1000
        for: 10m
        labels:
          severity: warning
          component: postgres
        annotations:
          summary: "Slow database queries detected"
          description: "Mean query execution time is {{ $value }}ms (threshold: 1000ms)"
          runbook: "https://wiki.erp-generic.com/runbooks/slow-queries"

      # High Number of Deadlocks
      - alert: HighDeadlocks
        expr: |
          rate(pg_stat_database_deadlocks[5m]) > 5
        for: 5m
        labels:
          severity: warning
          component: postgres
        annotations:
          summary: "High number of database deadlocks"
          description: "Deadlock rate is {{ $value }}/s (threshold: 5/s)"

      # Replication Lag (if using replicas)
      - alert: ReplicationLag
        expr: |
          pg_replication_lag_seconds > 60
        for: 5m
        labels:
          severity: warning
          component: postgres
        annotations:
          summary: "Database replication lag is high"
          description: "Replication lag is {{ $value }}s (threshold: 60s)"

      # Disk Space Low (Database)
      - alert: DatabaseDiskSpaceLow
        expr: |
          (node_filesystem_avail_bytes{mountpoint="/var/lib/postgresql"} / node_filesystem_size_bytes{mountpoint="/var/lib/postgresql"}) < 0.15
        for: 5m
        labels:
          severity: critical
          component: postgres
        annotations:
          summary: "Database disk space is low"
          description: "Only {{ $value | humanizePercentage }} disk space remaining"
          runbook: "https://wiki.erp-generic.com/runbooks/database-disk-full"
```

### 5.4 Business Alert Rules

**File:** `prometheus/alerts/business.yml`

```yaml
groups:
  - name: erp_business_alerts
    interval: 1m
    rules:
      # No Sales Orders Created (Business Hours)
      - alert: NoSalesOrdersCreated
        expr: |
          increase(erp_sales_orders_created_total[1h]) == 0
          and ON() hour() >= 9 and ON() hour() < 18
        for: 1h
        labels:
          severity: warning
          component: business
        annotations:
          summary: "No sales orders created in the last hour during business hours"
          description: "This might indicate a problem with the order creation system"

      # High Order Cancellation Rate
      - alert: HighOrderCancellationRate
        expr: |
          (rate(erp_sales_orders_created_total{status="cancelled"}[1h]) / rate(erp_sales_orders_created_total[1h])) > 0.2
        for: 30m
        labels:
          severity: warning
          component: business
        annotations:
          summary: "High order cancellation rate"
          description: "{{ $value | humanizePercentage }} of orders are being cancelled (threshold: 20%)"

      # Failed Login Spike
      - alert: FailedLoginSpike
        expr: |
          rate(erp_login_failures_total[5m]) > 10
        for: 5m
        labels:
          severity: warning
          component: security
        annotations:
          summary: "Spike in failed login attempts"
          description: "{{ $value }} failed logins per second (threshold: 10/s). Possible brute-force attack."
          runbook: "https://wiki.erp-generic.com/runbooks/brute-force-attack"
```

---

## 6. LOGGING STRATEGY

### 6.1 Winston Configuration

**File:** `backend/src/common/logger/logger.service.ts`

```typescript
import { Injectable, LoggerService as NestLoggerService } from '@nestjs/common';
import * as winston from 'winston';
import 'winston-daily-rotate-file';

@Injectable()
export class LoggerService implements NestLoggerService {
  private logger: winston.Logger;

  constructor() {
    this.logger = winston.createLogger({
      level: process.env.LOG_LEVEL || 'info',
      format: winston.format.combine(
        winston.format.timestamp({ format: 'YYYY-MM-DD HH:mm:ss' }),
        winston.format.errors({ stack: true }),
        winston.format.splat(),
        winston.format.json(),
      ),
      defaultMeta: {
        service: 'erp-generic-backend',
        environment: process.env.NODE_ENV,
      },
      transports: [
        // Console transport (for development)
        new winston.transports.Console({
          format: winston.format.combine(
            winston.format.colorize(),
            winston.format.printf(({ timestamp, level, message, context, ...meta }) => {
              return `${timestamp} [${level}] [${context || 'Application'}] ${message} ${
                Object.keys(meta).length ? JSON.stringify(meta, null, 2) : ''
              }`;
            }),
          ),
        }),

        // File transport - All logs
        new winston.transports.DailyRotateFile({
          filename: 'logs/application-%DATE%.log',
          datePattern: 'YYYY-MM-DD',
          maxSize: '20m',
          maxFiles: '14d',
          zippedArchive: true,
        }),

        // File transport - Error logs only
        new winston.transports.DailyRotateFile({
          level: 'error',
          filename: 'logs/error-%DATE%.log',
          datePattern: 'YYYY-MM-DD',
          maxSize: '20m',
          maxFiles: '30d',
          zippedArchive: true,
        }),

        // File transport - Audit logs (security events)
        new winston.transports.DailyRotateFile({
          filename: 'logs/audit-%DATE%.log',
          datePattern: 'YYYY-MM-DD',
          maxSize: '50m',
          maxFiles: '90d', // Keep for 90 days (compliance)
          zippedArchive: true,
        }),
      ],
    });

    // Add Elasticsearch/Loki transport for production
    if (process.env.NODE_ENV === 'production') {
      // Example: Winston-Elasticsearch
      // this.logger.add(new WinstonElasticsearch({
      //   level: 'info',
      //   clientOpts: {
      //     node: process.env.ELASTICSEARCH_URL,
      //     auth: {
      //       username: process.env.ELASTICSEARCH_USER,
      //       password: process.env.ELASTICSEARCH_PASSWORD,
      //     },
      //   },
      //   index: 'erp-generic-logs',
      // }));
    }
  }

  log(message: string, context?: string, meta?: any) {
    this.logger.info(message, { context, ...meta });
  }

  error(message: string, trace?: string, context?: string, meta?: any) {
    this.logger.error(message, { trace, context, ...meta });
  }

  warn(message: string, context?: string, meta?: any) {
    this.logger.warn(message, { context, ...meta });
  }

  debug(message: string, context?: string, meta?: any) {
    this.logger.debug(message, { context, ...meta });
  }

  verbose(message: string, context?: string, meta?: any) {
    this.logger.verbose(message, { context, ...meta });
  }

  // Audit logging (security-sensitive events)
  audit(event: string, userId: string, tenantId: string, details: any) {
    this.logger.info('AUDIT_EVENT', {
      event,
      userId,
      tenantId,
      details,
      timestamp: new Date().toISOString(),
      ip: details.ip,
      userAgent: details.userAgent,
    });
  }
}
```

### 6.2 Structured Logging Examples

```typescript
// Login attempt
logger.audit('USER_LOGIN', userId, tenantId, {
  method: 'email',
  ip: request.ip,
  userAgent: request.headers['user-agent'],
  success: true,
});

// Database query
logger.debug('DB_QUERY', 'DatabaseService', {
  operation: 'SELECT',
  table: 'auth.users',
  duration: 45, // ms
  rowCount: 1,
});

// API request
logger.info('HTTP_REQUEST', 'HttpMiddleware', {
  method: 'POST',
  path: '/api/sales/orders',
  statusCode: 201,
  duration: 234, // ms
  userId: '123e4567-e89b-12d3-a456-426614174000',
  tenantId: 'tenant-abc',
});

// Error with stack trace
logger.error('ORDER_CREATION_FAILED', error.stack, 'OrderService', {
  orderId: '123',
  tenantId: 'tenant-abc',
  error: error.message,
});
```

### 6.3 Log Aggregation (ELK Stack)

**Docker Compose for ELK Stack:**

```yaml
version: '3.9'

services:
  elasticsearch:
    image: docker.elastic.co/elasticsearch/elasticsearch:8.10.0
    container_name: erp-elasticsearch
    environment:
      - discovery.type=single-node
      - ES_JAVA_OPTS=-Xms2g -Xmx2g
      - xpack.security.enabled=false
    volumes:
      - elasticsearch_data:/usr/share/elasticsearch/data
    ports:
      - "9200:9200"
    networks:
      - monitoring
    restart: always

  logstash:
    image: docker.elastic.co/logstash/logstash:8.10.0
    container_name: erp-logstash
    volumes:
      - ./logstash/logstash.conf:/usr/share/logstash/pipeline/logstash.conf:ro
    ports:
      - "5044:5044"
    environment:
      LS_JAVA_OPTS: "-Xmx512m -Xms512m"
    networks:
      - monitoring
    depends_on:
      - elasticsearch
    restart: always

  kibana:
    image: docker.elastic.co/kibana/kibana:8.10.0
    container_name: erp-kibana
    ports:
      - "5601:5601"
    environment:
      ELASTICSEARCH_URL: http://elasticsearch:9200
      ELASTICSEARCH_HOSTS: '["http://elasticsearch:9200"]'
    networks:
      - monitoring
    depends_on:
      - elasticsearch
    restart: always

volumes:
  elasticsearch_data:

networks:
  monitoring:
    external: true
    name: erp-monitoring
```

**Logstash Configuration:**

```conf
input {
  file {
    path => "/var/log/erp-generic/application-*.log"
    type => "application"
    codec => json
    start_position => "beginning"
  }

  file {
    path => "/var/log/erp-generic/error-*.log"
    type => "error"
    codec => json
    start_position => "beginning"
  }

  file {
    path => "/var/log/erp-generic/audit-*.log"
    type => "audit"
    codec => json
    start_position => "beginning"
  }
}

filter {
  # Parse timestamp
  date {
    match => [ "timestamp", "ISO8601" ]
    target => "@timestamp"
  }

  # Add geoip for IP addresses
  if [ip] {
    geoip {
      source => "ip"
      target => "geoip"
    }
  }

  # Extract tenant_id as a field
  if [tenantId] {
    mutate {
      add_field => { "tenant" => "%{tenantId}" }
    }
  }
}

output {
  elasticsearch {
    hosts => ["elasticsearch:9200"]
    index => "erp-generic-logs-%{+YYYY.MM.dd}"
  }

  # Debug output (optional)
  stdout {
    codec => rubydebug
  }
}
```

---

## 7. APPLICATION PERFORMANCE MONITORING (APM)

### 7.1 Custom Metrics Endpoints

**File:** `backend/src/metrics/metrics.controller.ts`

```typescript
import { Controller, Get } from '@nestjs/common';
import { MetricsService } from '../common/metrics/metrics.service';
import { PrismaService } from '../common/prisma/prisma.service';

@Controller('metrics')
export class MetricsController {
  constructor(
    private metricsService: MetricsService,
    private prisma: PrismaService,
  ) {}

  @Get()
  getMetrics() {
    return this.metricsService.getMetrics();
  }

  @Get('business')
  async getBusinessMetrics() {
    // Aggregate business metrics from database
    const [salesOrders, purchaseOrders, invoices, activeUsers] = await Promise.all([
      this.prisma.salesOrder.count(),
      this.prisma.purchaseOrder.count(),
      this.prisma.invoice.count(),
      this.prisma.user.count({ where: { status: 'active' } }),
    ]);

    return {
      sales_orders_total: salesOrders,
      purchase_orders_total: purchaseOrders,
      invoices_total: invoices,
      active_users_total: activeUsers,
    };
  }
}
```

### 7.2 Performance Profiling

**Prisma Query Logging:**

```typescript
// prisma/prisma.service.ts
import { Injectable, OnModuleInit } from '@nestjs/common';
import { PrismaClient } from '@prisma/client';
import { LoggerService } from '../logger/logger.service';

@Injectable()
export class PrismaService extends PrismaClient implements OnModuleInit {
  constructor(private logger: LoggerService) {
    super({
      log: [
        { emit: 'event', level: 'query' },
        { emit: 'event', level: 'error' },
        { emit: 'event', level: 'warn' },
      ],
    });

    // Log slow queries (>100ms)
    this.$on('query' as never, (e: any) => {
      if (e.duration > 100) {
        this.logger.warn('SLOW_QUERY', 'PrismaService', {
          query: e.query,
          duration: e.duration,
          params: e.params,
        });
      }
    });

    // Log query errors
    this.$on('error' as never, (e: any) => {
      this.logger.error('DB_ERROR', e.message, 'PrismaService', {
        target: e.target,
      });
    });
  }

  async onModuleInit() {
    await this.$connect();
  }
}
```

---

## 8. HEALTH CHECKS

### 8.1 Health Check Endpoints

```typescript
// health/health.controller.ts
import { Controller, Get } from '@nestjs/common';
import { HealthCheck, HealthCheckService, PrismaHealthIndicator, MemoryHealthIndicator, DiskHealthIndicator } from '@nestjs/terminus';
import { RedisHealthIndicator } from './redis.health';

@Controller('health')
export class HealthController {
  constructor(
    private health: HealthCheckService,
    private db: PrismaHealthIndicator,
    private redis: RedisHealthIndicator,
    private memory: MemoryHealthIndicator,
    private disk: DiskHealthIndicator,
  ) {}

  @Get()
  @HealthCheck()
  check() {
    return this.health.check([
      () => this.db.pingCheck('database', { timeout: 3000 }),
      () => this.redis.isHealthy('redis'),
      () => this.memory.checkHeap('memory_heap', 200 * 1024 * 1024),
      () => this.disk.checkStorage('disk', { path: '/', thresholdPercent: 0.9 }),
    ]);
  }

  @Get('live')
  liveness() {
    return { status: 'ok', timestamp: new Date().toISOString() };
  }

  @Get('ready')
  @HealthCheck()
  readiness() {
    return this.health.check([
      () => this.db.pingCheck('database'),
      () => this.redis.isHealthy('redis'),
    ]);
  }
}
```

---

## 9. DISTRIBUTED TRACING

### 9.1 OpenTelemetry Setup

```typescript
// tracing.ts (Bootstrap file)
import { NodeSDK } from '@opentelemetry/sdk-node';
import { getNodeAutoInstrumentations } from '@opentelemetry/auto-instrumentations-node';
import { JaegerExporter } from '@opentelemetry/exporter-jaeger';

const sdk = new NodeSDK({
  traceExporter: new JaegerExporter({
    endpoint: process.env.JAEGER_ENDPOINT || 'http://jaeger:14268/api/traces',
  }),
  instrumentations: [getNodeAutoInstrumentations()],
});

sdk.start();
```

---

## 10. ON-CALL & INCIDENT RESPONSE

### 10.1 On-Call Rotation

- **Primary On-Call:** DevOps Engineer (24/7)
- **Secondary On-Call:** Backend Lead
- **Escalation Path:** CTO → CEO

### 10.2 Incident Severity

| Severity | Response Time | Examples |
|----------|---------------|----------|
| **P0 (Critical)** | 15 min | System down, data loss |
| **P1 (High)** | 1 hour | Major feature broken |
| **P2 (Medium)** | 4 hours | Minor feature broken |
| **P3 (Low)** | 24 hours | Cosmetic issue |

---

## 11. REFERENCES

- [Deployment Guide](./DEPLOYMENT-GUIDE.md)
- [Prometheus Documentation](https://prometheus.io/docs/)
- [Grafana Documentation](https://grafana.com/docs/)
- [Google SRE Book](https://sre.google/sre-book/table-of-contents/)

---

**Documento:** MONITORING-OBSERVABILITY.md
**Versión:** 1.0
**Total Páginas:** ~18
**Última Actualización:** 2025-11-24