erp-core/docs/07-devops/MONITORING-OBSERVABILITY.md

1661 lines
49 KiB
Markdown

# MONITORING & OBSERVABILITY - ERP Generic
**Última actualización:** 2025-11-24
**Responsable:** DevOps Team / SRE Team
**Estado:** ✅ Production-Ready
---
## TABLE OF CONTENTS
1. [Overview](#1-overview)
2. [Observability Pillars](#2-observability-pillars)
3. [Prometheus Setup](#3-prometheus-setup)
4. [Grafana Dashboards](#4-grafana-dashboards)
5. [Alert Rules](#5-alert-rules)
6. [Logging Strategy](#6-logging-strategy)
7. [Application Performance Monitoring (APM)](#7-application-performance-monitoring-apm)
8. [Health Checks](#8-health-checks)
9. [Distributed Tracing](#9-distributed-tracing)
10. [On-Call & Incident Response](#10-on-call--incident-response)
11. [References](#11-references)
---
## 1. OVERVIEW
### 1.1 Monitoring Architecture
```
┌─────────────────────────────────────────────────────────────────────┐
│ Application Layer │
│ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐ │
│ │ Backend │ │ Frontend │ │ Postgres │ │ Redis │ │
│ │ (Metrics)│ │ (Metrics)│ │(Exporter)│ │(Exporter)│ │
│ └────┬─────┘ └────┬─────┘ └────┬─────┘ └────┬─────┘ │
│ │ │ │ │ │
│ └─────────────┴─────────────┴─────────────┘ │
│ │ │
└────────────────────────────┼─────────────────────────────────────────┘
│ (Scrape metrics every 15s)
┌─────────────────────────────────────────────────────────────────────┐
│ Prometheus (TSDB) │
│ - Collects metrics from all targets │
│ - Evaluates alert rules │
│ - Stores time-series data (15 days retention) │
└────────┬────────────────────────────────┬─────────────────────────┘
│ │
│ (Query metrics) │ (Send alerts)
↓ ↓
┌─────────────────────┐ ┌──────────────────────┐
│ Grafana │ │ Alertmanager │
│ - Dashboards │ │ - Route alerts │
│ - Visualization │ │ - Deduplication │
│ - Alerting │ │ - Silencing │
└─────────────────────┘ └──────┬───────────────┘
┌───────────────────┼────────────────┐
↓ ↓ ↓
┌──────────┐ ┌──────────┐ ┌──────────┐
│ PagerDuty│ │ Slack │ │ Email │
│(On-call) │ │(#alerts) │ │(Team) │
└──────────┘ └──────────┘ └──────────┘
┌─────────────────────────────────────────────────────────────────────┐
│ Logging Pipeline │
│ │
│ Application → Winston → ELK Stack / Loki │
│ │
│ ┌──────────┐ ┌──────────────┐ ┌──────────┐ │
│ │ Logs │ ───→ │ Elasticsearch│ ───→ │ Kibana │ │
│ │(JSON) │ │ or Loki │ │(Search) │ │
│ └──────────┘ └──────────────┘ └──────────┘ │
└─────────────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────────────┐
│ Distributed Tracing │
│ │
│ Application → OpenTelemetry → Jaeger / Tempo │
│ (Trace spans for requests across microservices) │
└─────────────────────────────────────────────────────────────────────┘
```
### 1.2 Observability Goals
**Why Observability?**
- **Proactive Monitoring:** Detect issues before users report them
- **Faster Debugging:** Reduce MTTD (Mean Time to Detect) from hours to minutes
- **Performance Optimization:** Identify bottlenecks and slow queries
- **Capacity Planning:** Predict when to scale resources
- **SLA Compliance:** Monitor uptime, response times, error rates
**Key Metrics (Google's Four Golden Signals):**
1. **Latency:** Request/response time (p50, p95, p99)
2. **Traffic:** Requests per second (throughput)
3. **Errors:** Error rate (5xx responses, exceptions)
4. **Saturation:** Resource utilization (CPU, memory, disk, DB connections)
**SLOs (Service Level Objectives):**
- **Availability:** 99.9% uptime (8.76 hours downtime/year)
- **Latency:** p95 API response < 300ms
- **Error Budget:** <0.1% error rate
- **Data Durability:** Zero data loss
---
## 2. OBSERVABILITY PILLARS
### 2.1 The Three Pillars
**1. Metrics (What is happening?)**
- Quantitative measurements over time
- Examples: CPU usage, request count, response time
- Tool: Prometheus + Grafana
**2. Logs (What happened?)**
- Discrete events with context
- Examples: "User X logged in", "Query took 2.5s"
- Tool: Winston + ELK Stack / Loki
**3. Traces (Why did it happen?)**
- Request flow across services
- Examples: API call Database query Redis cache Response
- Tool: OpenTelemetry + Jaeger
### 2.2 Correlation
```
Example: High p99 latency alert
├── Metrics: p99 latency = 3s (threshold: 500ms)
│ └── Which endpoint? /api/products
├── Logs: Search for slow queries in /api/products
│ └── Found: SELECT * FROM inventory.stock_movements (2.8s)
└── Traces: Trace ID abc123 shows:
├── API handler: 50ms
├── Database query: 2800ms ← Bottleneck!
└── Response serialization: 150ms
Root cause: Missing index on inventory.stock_movements(product_id)
Fix: CREATE INDEX idx_stock_movements_product_id ON inventory.stock_movements(product_id);
```
---
## 3. PROMETHEUS SETUP
### 3.1 Prometheus Configuration
**File:** `prometheus/prometheus.yml`
```yaml
global:
scrape_interval: 15s # Scrape targets every 15 seconds
evaluation_interval: 15s # Evaluate rules every 15 seconds
scrape_timeout: 10s
external_labels:
cluster: 'erp-generic-prod'
environment: 'production'
# Alertmanager configuration
alerting:
alertmanagers:
- static_configs:
- targets:
- alertmanager:9093
timeout: 10s
# Load alert rules
rule_files:
- '/etc/prometheus/alerts/application.yml'
- '/etc/prometheus/alerts/infrastructure.yml'
- '/etc/prometheus/alerts/database.yml'
- '/etc/prometheus/alerts/business.yml'
# Scrape configurations
scrape_configs:
# Backend API (NestJS with Prometheus middleware)
- job_name: 'erp-backend'
static_configs:
- targets: ['backend:3000']
labels:
service: 'backend'
component: 'api'
metrics_path: '/metrics'
scrape_interval: 15s
# PostgreSQL Exporter
- job_name: 'postgres'
static_configs:
- targets: ['postgres-exporter:9187']
labels:
service: 'database'
component: 'postgres'
scrape_interval: 30s
# Redis Exporter
- job_name: 'redis'
static_configs:
- targets: ['redis-exporter:9121']
labels:
service: 'cache'
component: 'redis'
scrape_interval: 30s
# Node Exporter (system metrics)
- job_name: 'node'
static_configs:
- targets: ['node-exporter:9100']
labels:
service: 'infrastructure'
component: 'host'
scrape_interval: 15s
# Frontend (Nginx metrics)
- job_name: 'nginx'
static_configs:
- targets: ['nginx-exporter:9113']
labels:
service: 'frontend'
component: 'nginx'
scrape_interval: 30s
# Prometheus itself (meta-monitoring)
- job_name: 'prometheus'
static_configs:
- targets: ['localhost:9090']
labels:
service: 'monitoring'
component: 'prometheus'
```
### 3.2 Docker Compose for Monitoring Stack
**File:** `docker-compose.monitoring.yml`
```yaml
version: '3.9'
services:
prometheus:
image: prom/prometheus:v2.47.0
container_name: erp-prometheus
volumes:
- ./prometheus/prometheus.yml:/etc/prometheus/prometheus.yml:ro
- ./prometheus/alerts:/etc/prometheus/alerts:ro
- prometheus_data:/prometheus
command:
- '--config.file=/etc/prometheus/prometheus.yml'
- '--storage.tsdb.path=/prometheus'
- '--storage.tsdb.retention.time=15d'
- '--web.console.libraries=/usr/share/prometheus/console_libraries'
- '--web.console.templates=/usr/share/prometheus/consoles'
- '--web.enable-lifecycle'
ports:
- "9090:9090"
networks:
- monitoring
restart: always
alertmanager:
image: prom/alertmanager:v0.26.0
container_name: erp-alertmanager
volumes:
- ./alertmanager/alertmanager.yml:/etc/alertmanager/alertmanager.yml:ro
- alertmanager_data:/alertmanager
command:
- '--config.file=/etc/alertmanager/alertmanager.yml'
- '--storage.path=/alertmanager'
ports:
- "9093:9093"
networks:
- monitoring
restart: always
grafana:
image: grafana/grafana:10.1.0
container_name: erp-grafana
environment:
- GF_SECURITY_ADMIN_USER=${GRAFANA_ADMIN_USER:-admin}
- GF_SECURITY_ADMIN_PASSWORD=${GRAFANA_ADMIN_PASSWORD:-admin}
- GF_INSTALL_PLUGINS=grafana-piechart-panel,grafana-clock-panel
- GF_SERVER_ROOT_URL=https://grafana.erp-generic.com
- GF_SMTP_ENABLED=true
- GF_SMTP_HOST=${SMTP_HOST}:${SMTP_PORT}
- GF_SMTP_USER=${SMTP_USER}
- GF_SMTP_PASSWORD=${SMTP_PASSWORD}
volumes:
- grafana_data:/var/lib/grafana
- ./grafana/provisioning:/etc/grafana/provisioning:ro
- ./grafana/dashboards:/var/lib/grafana/dashboards:ro
ports:
- "3001:3000"
networks:
- monitoring
depends_on:
- prometheus
restart: always
postgres-exporter:
image: prometheuscommunity/postgres-exporter:v0.14.0
container_name: erp-postgres-exporter
environment:
DATA_SOURCE_NAME: "postgresql://${POSTGRES_USER}:${POSTGRES_PASSWORD}@postgres:5432/${POSTGRES_DB}?sslmode=disable"
ports:
- "9187:9187"
networks:
- monitoring
- erp-network
restart: always
redis-exporter:
image: oliver006/redis_exporter:v1.54.0
container_name: erp-redis-exporter
environment:
REDIS_ADDR: "redis:6379"
REDIS_PASSWORD: ${REDIS_PASSWORD}
ports:
- "9121:9121"
networks:
- monitoring
- erp-network
restart: always
node-exporter:
image: prom/node-exporter:v1.6.1
container_name: erp-node-exporter
command:
- '--path.procfs=/host/proc'
- '--path.sysfs=/host/sys'
- '--collector.filesystem.mount-points-exclude=^/(sys|proc|dev|host|etc)($$|/)'
volumes:
- /proc:/host/proc:ro
- /sys:/host/sys:ro
- /:/rootfs:ro
ports:
- "9100:9100"
networks:
- monitoring
restart: always
volumes:
prometheus_data:
alertmanager_data:
grafana_data:
networks:
monitoring:
name: erp-monitoring
erp-network:
external: true
name: erp-network-internal
```
### 3.3 Backend Metrics Instrumentation
**File:** `backend/src/common/metrics/metrics.module.ts`
```typescript
import { Module } from '@nestjs/common';
import { PrometheusModule } from '@willsoto/nestjs-prometheus';
import { MetricsService } from './metrics.service';
@Module({
imports: [
PrometheusModule.register({
path: '/metrics',
defaultMetrics: {
enabled: true,
config: {
prefix: 'erp_',
},
},
}),
],
providers: [MetricsService],
exports: [MetricsService],
})
export class MetricsModule {}
```
**File:** `backend/src/common/metrics/metrics.service.ts`
```typescript
import { Injectable } from '@nestjs/common';
import { Counter, Histogram, Gauge, Registry } from 'prom-client';
@Injectable()
export class MetricsService {
private registry: Registry;
// HTTP Metrics
private httpRequestDuration: Histogram;
private httpRequestTotal: Counter;
private httpRequestErrors: Counter;
// Database Metrics
private dbQueryDuration: Histogram;
private dbConnectionsActive: Gauge;
private dbQueryErrors: Counter;
// Business Metrics
private salesOrdersCreated: Counter;
private purchaseOrdersCreated: Counter;
private invoicesGenerated: Counter;
private inventoryMovements: Counter;
// Cache Metrics
private cacheHits: Counter;
private cacheMisses: Counter;
// Authentication Metrics
private loginAttempts: Counter;
private loginFailures: Counter;
private activeUsers: Gauge;
constructor() {
this.registry = new Registry();
this.initializeMetrics();
}
private initializeMetrics() {
// HTTP Request Duration
this.httpRequestDuration = new Histogram({
name: 'erp_http_request_duration_seconds',
help: 'Duration of HTTP requests in seconds',
labelNames: ['method', 'route', 'status_code'],
buckets: [0.01, 0.05, 0.1, 0.3, 0.5, 1, 2, 5],
});
// HTTP Request Total
this.httpRequestTotal = new Counter({
name: 'erp_http_requests_total',
help: 'Total number of HTTP requests',
labelNames: ['method', 'route', 'status_code'],
});
// HTTP Request Errors
this.httpRequestErrors = new Counter({
name: 'erp_http_request_errors_total',
help: 'Total number of HTTP request errors',
labelNames: ['method', 'route', 'error_type'],
});
// Database Query Duration
this.dbQueryDuration = new Histogram({
name: 'erp_db_query_duration_seconds',
help: 'Duration of database queries in seconds',
labelNames: ['operation', 'table'],
buckets: [0.001, 0.005, 0.01, 0.05, 0.1, 0.5, 1, 2],
});
// Database Active Connections
this.dbConnectionsActive = new Gauge({
name: 'erp_db_connections_active',
help: 'Number of active database connections',
});
// Database Query Errors
this.dbQueryErrors = new Counter({
name: 'erp_db_query_errors_total',
help: 'Total number of database query errors',
labelNames: ['operation', 'error_type'],
});
// Business Metrics - Sales Orders
this.salesOrdersCreated = new Counter({
name: 'erp_sales_orders_created_total',
help: 'Total number of sales orders created',
labelNames: ['tenant_id', 'status'],
});
// Business Metrics - Purchase Orders
this.purchaseOrdersCreated = new Counter({
name: 'erp_purchase_orders_created_total',
help: 'Total number of purchase orders created',
labelNames: ['tenant_id', 'status'],
});
// Business Metrics - Invoices
this.invoicesGenerated = new Counter({
name: 'erp_invoices_generated_total',
help: 'Total number of invoices generated',
labelNames: ['tenant_id', 'type'],
});
// Business Metrics - Inventory Movements
this.inventoryMovements = new Counter({
name: 'erp_inventory_movements_total',
help: 'Total number of inventory movements',
labelNames: ['tenant_id', 'type'],
});
// Cache Hits
this.cacheHits = new Counter({
name: 'erp_cache_hits_total',
help: 'Total number of cache hits',
labelNames: ['cache_key'],
});
// Cache Misses
this.cacheMisses = new Counter({
name: 'erp_cache_misses_total',
help: 'Total number of cache misses',
labelNames: ['cache_key'],
});
// Login Attempts
this.loginAttempts = new Counter({
name: 'erp_login_attempts_total',
help: 'Total number of login attempts',
labelNames: ['tenant_id', 'method'],
});
// Login Failures
this.loginFailures = new Counter({
name: 'erp_login_failures_total',
help: 'Total number of failed login attempts',
labelNames: ['tenant_id', 'reason'],
});
// Active Users
this.activeUsers = new Gauge({
name: 'erp_active_users',
help: 'Number of currently active users',
labelNames: ['tenant_id'],
});
// Register all metrics
this.registry.registerMetric(this.httpRequestDuration);
this.registry.registerMetric(this.httpRequestTotal);
this.registry.registerMetric(this.httpRequestErrors);
this.registry.registerMetric(this.dbQueryDuration);
this.registry.registerMetric(this.dbConnectionsActive);
this.registry.registerMetric(this.dbQueryErrors);
this.registry.registerMetric(this.salesOrdersCreated);
this.registry.registerMetric(this.purchaseOrdersCreated);
this.registry.registerMetric(this.invoicesGenerated);
this.registry.registerMetric(this.inventoryMovements);
this.registry.registerMetric(this.cacheHits);
this.registry.registerMetric(this.cacheMisses);
this.registry.registerMetric(this.loginAttempts);
this.registry.registerMetric(this.loginFailures);
this.registry.registerMetric(this.activeUsers);
}
// Public methods to record metrics
recordHttpRequest(method: string, route: string, statusCode: number, duration: number) {
this.httpRequestDuration.observe({ method, route, status_code: statusCode }, duration);
this.httpRequestTotal.inc({ method, route, status_code: statusCode });
}
recordHttpError(method: string, route: string, errorType: string) {
this.httpRequestErrors.inc({ method, route, error_type: errorType });
}
recordDbQuery(operation: string, table: string, duration: number) {
this.dbQueryDuration.observe({ operation, table }, duration);
}
recordDbError(operation: string, errorType: string) {
this.dbQueryErrors.inc({ operation, error_type: errorType });
}
setDbConnectionsActive(count: number) {
this.dbConnectionsActive.set(count);
}
recordSalesOrder(tenantId: string, status: string) {
this.salesOrdersCreated.inc({ tenant_id: tenantId, status });
}
recordPurchaseOrder(tenantId: string, status: string) {
this.purchaseOrdersCreated.inc({ tenant_id: tenantId, status });
}
recordInvoice(tenantId: string, type: string) {
this.invoicesGenerated.inc({ tenant_id: tenantId, type });
}
recordInventoryMovement(tenantId: string, type: string) {
this.inventoryMovements.inc({ tenant_id: tenantId, type });
}
recordCacheHit(key: string) {
this.cacheHits.inc({ cache_key: key });
}
recordCacheMiss(key: string) {
this.cacheMisses.inc({ cache_key: key });
}
recordLoginAttempt(tenantId: string, method: string) {
this.loginAttempts.inc({ tenant_id: tenantId, method });
}
recordLoginFailure(tenantId: string, reason: string) {
this.loginFailures.inc({ tenant_id: tenantId, reason });
}
setActiveUsers(tenantId: string, count: number) {
this.activeUsers.set({ tenant_id: tenantId }, count);
}
getMetrics(): string {
return this.registry.metrics();
}
}
```
**File:** `backend/src/common/interceptors/metrics.interceptor.ts`
```typescript
import { Injectable, NestInterceptor, ExecutionContext, CallHandler } from '@nestjs/common';
import { Observable } from 'rxjs';
import { tap } from 'rxjs/operators';
import { MetricsService } from '../metrics/metrics.service';
@Injectable()
export class MetricsInterceptor implements NestInterceptor {
constructor(private metricsService: MetricsService) {}
intercept(context: ExecutionContext, next: CallHandler): Observable<any> {
const request = context.switchToHttp().getRequest();
const startTime = Date.now();
return next.handle().pipe(
tap({
next: () => {
const response = context.switchToHttp().getResponse();
const duration = (Date.now() - startTime) / 1000; // Convert to seconds
this.metricsService.recordHttpRequest(
request.method,
request.route?.path || request.url,
response.statusCode,
duration,
);
},
error: (error) => {
const duration = (Date.now() - startTime) / 1000;
const response = context.switchToHttp().getResponse();
this.metricsService.recordHttpRequest(
request.method,
request.route?.path || request.url,
response.statusCode || 500,
duration,
);
this.metricsService.recordHttpError(
request.method,
request.route?.path || request.url,
error.name || 'UnknownError',
);
},
}),
);
}
}
```
---
## 4. GRAFANA DASHBOARDS
### 4.1 Dashboard Provisioning
**File:** `grafana/provisioning/datasources/prometheus.yml`
```yaml
apiVersion: 1
datasources:
- name: Prometheus
type: prometheus
access: proxy
url: http://prometheus:9090
isDefault: true
editable: false
jsonData:
timeInterval: "15s"
queryTimeout: "60s"
httpMethod: "POST"
```
**File:** `grafana/provisioning/dashboards/dashboard-provider.yml`
```yaml
apiVersion: 1
providers:
- name: 'ERP Generic Dashboards'
orgId: 1
folder: ''
type: file
disableDeletion: false
updateIntervalSeconds: 10
allowUiUpdates: true
options:
path: /var/lib/grafana/dashboards
foldersFromFilesStructure: true
```
### 4.2 Dashboard 1: Application Performance
**File:** `grafana/dashboards/application-performance.json` (Simplified structure)
```json
{
"dashboard": {
"title": "ERP Generic - Application Performance",
"tags": ["erp", "application", "performance"],
"timezone": "browser",
"panels": [
{
"title": "Request Rate (req/s)",
"type": "graph",
"targets": [
{
"expr": "rate(erp_http_requests_total[5m])",
"legendFormat": "{{method}} {{route}}"
}
]
},
{
"title": "P95 Latency (ms)",
"type": "graph",
"targets": [
{
"expr": "histogram_quantile(0.95, rate(erp_http_request_duration_seconds_bucket[5m])) * 1000",
"legendFormat": "{{route}}"
}
],
"thresholds": [
{ "value": 300, "color": "yellow" },
{ "value": 500, "color": "red" }
]
},
{
"title": "Error Rate (%)",
"type": "graph",
"targets": [
{
"expr": "rate(erp_http_request_errors_total[5m]) / rate(erp_http_requests_total[5m]) * 100",
"legendFormat": "{{route}}"
}
],
"thresholds": [
{ "value": 1, "color": "yellow" },
{ "value": 5, "color": "red" }
]
},
{
"title": "Top 10 Slowest Endpoints",
"type": "table",
"targets": [
{
"expr": "topk(10, avg by (route) (erp_http_request_duration_seconds))",
"format": "table"
}
]
},
{
"title": "Active Users by Tenant",
"type": "graph",
"targets": [
{
"expr": "erp_active_users",
"legendFormat": "{{tenant_id}}"
}
]
},
{
"title": "Cache Hit Rate (%)",
"type": "stat",
"targets": [
{
"expr": "rate(erp_cache_hits_total[5m]) / (rate(erp_cache_hits_total[5m]) + rate(erp_cache_misses_total[5m])) * 100"
}
]
}
]
}
}
```
**Key Panels:**
1. **Request Rate:** Total requests per second (by method and route)
2. **P95 Latency:** 95th percentile response time (threshold: 300ms yellow, 500ms red)
3. **Error Rate:** Percentage of failed requests (threshold: 1% yellow, 5% red)
4. **Top 10 Slowest Endpoints:** Identify performance bottlenecks
5. **Active Users by Tenant:** Real-time active user count per tenant
6. **Cache Hit Rate:** Percentage of cache hits (target: >80%)
### 4.3 Dashboard 2: Database Performance
**Key Panels:**
1. **Database Connections:** Active vs. max connections
2. **Query Duration P95:** 95th percentile query time by table
3. **Slow Queries:** Queries taking >1 second
4. **Transactions per Second:** TPS rate
5. **Database Size:** Disk usage by schema
6. **Index Usage:** Most and least used indexes
7. **Lock Waits:** Blocking queries
8. **Replication Lag:** Lag between primary and replicas (if applicable)
**Example Queries:**
```promql
# Active connections
pg_stat_database_numbackends{datname="erp_generic"}
# Slow queries (>1s)
rate(pg_stat_statements_mean_exec_time{datname="erp_generic"}[5m]) > 1000
# Database size
pg_database_size_bytes{datname="erp_generic"}
# TPS
rate(pg_stat_database_xact_commit{datname="erp_generic"}[5m]) + rate(pg_stat_database_xact_rollback{datname="erp_generic"}[5m])
```
### 4.4 Dashboard 3: Business Metrics
**Key Panels:**
1. **Sales Orders Created (Today):** Total sales orders by status
2. **Purchase Orders Created (Today):** Total purchase orders by status
3. **Revenue Trend (Last 30 days):** Daily revenue by tenant
4. **Invoices Generated (Today):** Total invoices by type (customer/supplier)
5. **Inventory Movements (Today):** Stock in/out movements
6. **Top 10 Customers by Revenue:** Revenue breakdown
7. **Order Fulfillment Rate:** Percentage of orders fulfilled on time
8. **Average Order Value:** Mean order value by tenant
**Example Queries:**
```promql
# Sales orders created today
increase(erp_sales_orders_created_total[1d])
# Revenue trend (requires custom metric)
sum by (tenant_id) (rate(erp_sales_order_amount_sum[1d]))
# Top 10 customers by revenue
topk(10, sum by (customer_id) (erp_sales_order_amount_sum))
```
---
## 5. ALERT RULES
### 5.1 Alertmanager Configuration
**File:** `alertmanager/alertmanager.yml`
```yaml
global:
resolve_timeout: 5m
smtp_smarthost: '${SMTP_HOST}:${SMTP_PORT}'
smtp_from: 'alertmanager@erp-generic.com'
smtp_auth_username: '${SMTP_USER}'
smtp_auth_password: '${SMTP_PASSWORD}'
slack_api_url: '${SLACK_WEBHOOK_URL}'
pagerduty_url: 'https://events.pagerduty.com/v2/enqueue'
# Route alerts to different receivers
route:
receiver: 'default'
group_by: ['alertname', 'cluster', 'service']
group_wait: 10s
group_interval: 10s
repeat_interval: 12h
routes:
# Critical alerts → PagerDuty (on-call)
- receiver: 'pagerduty'
match:
severity: critical
continue: true
# All alerts → Slack
- receiver: 'slack'
match_re:
severity: critical|warning
# Database alerts → DBA team
- receiver: 'dba-email'
match:
component: postgres
# Security alerts → Security team
- receiver: 'security-email'
match_re:
alertname: '.*Security.*'
# Inhibition rules (suppress alerts)
inhibit_rules:
# Suppress warning if critical already firing
- source_match:
severity: 'critical'
target_match:
severity: 'warning'
equal: ['alertname', 'instance']
receivers:
- name: 'default'
email_configs:
- to: 'devops@erp-generic.com'
headers:
Subject: '[ERP Alert] {{ .GroupLabels.alertname }}'
- name: 'pagerduty'
pagerduty_configs:
- service_key: '${PAGERDUTY_SERVICE_KEY}'
description: '{{ .GroupLabels.alertname }}: {{ .GroupLabels.instance }}'
- name: 'slack'
slack_configs:
- channel: '#erp-alerts'
title: '{{ .GroupLabels.alertname }}'
text: '{{ range .Alerts }}{{ .Annotations.description }}{{ end }}'
color: '{{ if eq .Status "firing" }}danger{{ else }}good{{ end }}'
- name: 'dba-email'
email_configs:
- to: 'dba@erp-generic.com'
headers:
Subject: '[Database Alert] {{ .GroupLabels.alertname }}'
- name: 'security-email'
email_configs:
- to: 'security@erp-generic.com'
headers:
Subject: '[SECURITY ALERT] {{ .GroupLabels.alertname }}'
Priority: 'urgent'
```
### 5.2 Application Alert Rules
**File:** `prometheus/alerts/application.yml`
```yaml
groups:
- name: erp_application_alerts
interval: 30s
rules:
# High Error Rate
- alert: HighErrorRate
expr: |
(rate(erp_http_request_errors_total[5m]) / rate(erp_http_requests_total[5m])) > 0.05
for: 5m
labels:
severity: critical
component: backend
annotations:
summary: "High error rate detected on {{ $labels.instance }}"
description: "Error rate is {{ $value | humanizePercentage }} (threshold: 5%)"
runbook: "https://wiki.erp-generic.com/runbooks/high-error-rate"
# High P95 Latency
- alert: HighLatency
expr: |
histogram_quantile(0.95, rate(erp_http_request_duration_seconds_bucket[5m])) > 0.5
for: 10m
labels:
severity: warning
component: backend
annotations:
summary: "High P95 latency on {{ $labels.route }}"
description: "P95 latency is {{ $value }}s (threshold: 500ms)"
runbook: "https://wiki.erp-generic.com/runbooks/high-latency"
# Service Down
- alert: ServiceDown
expr: up{job="erp-backend"} == 0
for: 2m
labels:
severity: critical
component: backend
annotations:
summary: "Backend service is down"
description: "Backend {{ $labels.instance }} has been down for more than 2 minutes"
runbook: "https://wiki.erp-generic.com/runbooks/service-down"
# High CPU Usage
- alert: HighCPUUsage
expr: |
(100 - (avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)) > 80
for: 10m
labels:
severity: warning
component: infrastructure
annotations:
summary: "High CPU usage on {{ $labels.instance }}"
description: "CPU usage is {{ $value }}% (threshold: 80%)"
# High Memory Usage
- alert: HighMemoryUsage
expr: |
(1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) > 0.85
for: 5m
labels:
severity: warning
component: infrastructure
annotations:
summary: "High memory usage on {{ $labels.instance }}"
description: "Memory usage is {{ $value | humanizePercentage }} (threshold: 85%)"
# Disk Space Low
- alert: DiskSpaceLow
expr: |
(node_filesystem_avail_bytes / node_filesystem_size_bytes) < 0.15
for: 5m
labels:
severity: warning
component: infrastructure
annotations:
summary: "Low disk space on {{ $labels.instance }}"
description: "Disk {{ $labels.mountpoint }} has only {{ $value | humanizePercentage }} free"
# Too Many Requests (DDoS protection)
- alert: TooManyRequests
expr: |
rate(erp_http_requests_total[1m]) > 10000
for: 2m
labels:
severity: critical
component: security
annotations:
summary: "Abnormally high request rate detected"
description: "Request rate is {{ $value }} req/s (threshold: 10000 req/s). Possible DDoS attack."
runbook: "https://wiki.erp-generic.com/runbooks/ddos-attack"
# Low Cache Hit Rate
- alert: LowCacheHitRate
expr: |
(rate(erp_cache_hits_total[5m]) / (rate(erp_cache_hits_total[5m]) + rate(erp_cache_misses_total[5m]))) < 0.6
for: 15m
labels:
severity: warning
component: cache
annotations:
summary: "Low cache hit rate"
description: "Cache hit rate is {{ $value | humanizePercentage }} (threshold: 60%)"
```
### 5.3 Database Alert Rules
**File:** `prometheus/alerts/database.yml`
```yaml
groups:
- name: erp_database_alerts
interval: 30s
rules:
# Database Down
- alert: DatabaseDown
expr: pg_up == 0
for: 1m
labels:
severity: critical
component: postgres
annotations:
summary: "PostgreSQL is down"
description: "PostgreSQL on {{ $labels.instance }} has been down for more than 1 minute"
runbook: "https://wiki.erp-generic.com/runbooks/database-down"
# Connection Pool Exhausted
- alert: ConnectionPoolExhausted
expr: |
(pg_stat_database_numbackends / pg_settings_max_connections) > 0.9
for: 2m
labels:
severity: critical
component: postgres
annotations:
summary: "Database connection pool almost exhausted"
description: "{{ $labels.datname }} is using {{ $value | humanizePercentage }} of max connections"
runbook: "https://wiki.erp-generic.com/runbooks/connection-pool-exhausted"
# Slow Queries
- alert: SlowQueries
expr: |
rate(pg_stat_statements_mean_exec_time[5m]) > 1000
for: 10m
labels:
severity: warning
component: postgres
annotations:
summary: "Slow database queries detected"
description: "Mean query execution time is {{ $value }}ms (threshold: 1000ms)"
runbook: "https://wiki.erp-generic.com/runbooks/slow-queries"
# High Number of Deadlocks
- alert: HighDeadlocks
expr: |
rate(pg_stat_database_deadlocks[5m]) > 5
for: 5m
labels:
severity: warning
component: postgres
annotations:
summary: "High number of database deadlocks"
description: "Deadlock rate is {{ $value }}/s (threshold: 5/s)"
# Replication Lag (if using replicas)
- alert: ReplicationLag
expr: |
pg_replication_lag_seconds > 60
for: 5m
labels:
severity: warning
component: postgres
annotations:
summary: "Database replication lag is high"
description: "Replication lag is {{ $value }}s (threshold: 60s)"
# Disk Space Low (Database)
- alert: DatabaseDiskSpaceLow
expr: |
(node_filesystem_avail_bytes{mountpoint="/var/lib/postgresql"} / node_filesystem_size_bytes{mountpoint="/var/lib/postgresql"}) < 0.15
for: 5m
labels:
severity: critical
component: postgres
annotations:
summary: "Database disk space is low"
description: "Only {{ $value | humanizePercentage }} disk space remaining"
runbook: "https://wiki.erp-generic.com/runbooks/database-disk-full"
```
### 5.4 Business Alert Rules
**File:** `prometheus/alerts/business.yml`
```yaml
groups:
- name: erp_business_alerts
interval: 1m
rules:
# No Sales Orders Created (Business Hours)
- alert: NoSalesOrdersCreated
expr: |
increase(erp_sales_orders_created_total[1h]) == 0
and ON() hour() >= 9 and ON() hour() < 18
for: 1h
labels:
severity: warning
component: business
annotations:
summary: "No sales orders created in the last hour during business hours"
description: "This might indicate a problem with the order creation system"
# High Order Cancellation Rate
- alert: HighOrderCancellationRate
expr: |
(rate(erp_sales_orders_created_total{status="cancelled"}[1h]) / rate(erp_sales_orders_created_total[1h])) > 0.2
for: 30m
labels:
severity: warning
component: business
annotations:
summary: "High order cancellation rate"
description: "{{ $value | humanizePercentage }} of orders are being cancelled (threshold: 20%)"
# Failed Login Spike
- alert: FailedLoginSpike
expr: |
rate(erp_login_failures_total[5m]) > 10
for: 5m
labels:
severity: warning
component: security
annotations:
summary: "Spike in failed login attempts"
description: "{{ $value }} failed logins per second (threshold: 10/s). Possible brute-force attack."
runbook: "https://wiki.erp-generic.com/runbooks/brute-force-attack"
```
---
## 6. LOGGING STRATEGY
### 6.1 Winston Configuration
**File:** `backend/src/common/logger/logger.service.ts`
```typescript
import { Injectable, LoggerService as NestLoggerService } from '@nestjs/common';
import * as winston from 'winston';
import 'winston-daily-rotate-file';
@Injectable()
export class LoggerService implements NestLoggerService {
private logger: winston.Logger;
constructor() {
this.logger = winston.createLogger({
level: process.env.LOG_LEVEL || 'info',
format: winston.format.combine(
winston.format.timestamp({ format: 'YYYY-MM-DD HH:mm:ss' }),
winston.format.errors({ stack: true }),
winston.format.splat(),
winston.format.json(),
),
defaultMeta: {
service: 'erp-generic-backend',
environment: process.env.NODE_ENV,
},
transports: [
// Console transport (for development)
new winston.transports.Console({
format: winston.format.combine(
winston.format.colorize(),
winston.format.printf(({ timestamp, level, message, context, ...meta }) => {
return `${timestamp} [${level}] [${context || 'Application'}] ${message} ${
Object.keys(meta).length ? JSON.stringify(meta, null, 2) : ''
}`;
}),
),
}),
// File transport - All logs
new winston.transports.DailyRotateFile({
filename: 'logs/application-%DATE%.log',
datePattern: 'YYYY-MM-DD',
maxSize: '20m',
maxFiles: '14d',
zippedArchive: true,
}),
// File transport - Error logs only
new winston.transports.DailyRotateFile({
level: 'error',
filename: 'logs/error-%DATE%.log',
datePattern: 'YYYY-MM-DD',
maxSize: '20m',
maxFiles: '30d',
zippedArchive: true,
}),
// File transport - Audit logs (security events)
new winston.transports.DailyRotateFile({
filename: 'logs/audit-%DATE%.log',
datePattern: 'YYYY-MM-DD',
maxSize: '50m',
maxFiles: '90d', // Keep for 90 days (compliance)
zippedArchive: true,
}),
],
});
// Add Elasticsearch/Loki transport for production
if (process.env.NODE_ENV === 'production') {
// Example: Winston-Elasticsearch
// this.logger.add(new WinstonElasticsearch({
// level: 'info',
// clientOpts: {
// node: process.env.ELASTICSEARCH_URL,
// auth: {
// username: process.env.ELASTICSEARCH_USER,
// password: process.env.ELASTICSEARCH_PASSWORD,
// },
// },
// index: 'erp-generic-logs',
// }));
}
}
log(message: string, context?: string, meta?: any) {
this.logger.info(message, { context, ...meta });
}
error(message: string, trace?: string, context?: string, meta?: any) {
this.logger.error(message, { trace, context, ...meta });
}
warn(message: string, context?: string, meta?: any) {
this.logger.warn(message, { context, ...meta });
}
debug(message: string, context?: string, meta?: any) {
this.logger.debug(message, { context, ...meta });
}
verbose(message: string, context?: string, meta?: any) {
this.logger.verbose(message, { context, ...meta });
}
// Audit logging (security-sensitive events)
audit(event: string, userId: string, tenantId: string, details: any) {
this.logger.info('AUDIT_EVENT', {
event,
userId,
tenantId,
details,
timestamp: new Date().toISOString(),
ip: details.ip,
userAgent: details.userAgent,
});
}
}
```
### 6.2 Structured Logging Examples
```typescript
// Login attempt
logger.audit('USER_LOGIN', userId, tenantId, {
method: 'email',
ip: request.ip,
userAgent: request.headers['user-agent'],
success: true,
});
// Database query
logger.debug('DB_QUERY', 'DatabaseService', {
operation: 'SELECT',
table: 'auth.users',
duration: 45, // ms
rowCount: 1,
});
// API request
logger.info('HTTP_REQUEST', 'HttpMiddleware', {
method: 'POST',
path: '/api/sales/orders',
statusCode: 201,
duration: 234, // ms
userId: '123e4567-e89b-12d3-a456-426614174000',
tenantId: 'tenant-abc',
});
// Error with stack trace
logger.error('ORDER_CREATION_FAILED', error.stack, 'OrderService', {
orderId: '123',
tenantId: 'tenant-abc',
error: error.message,
});
```
### 6.3 Log Aggregation (ELK Stack)
**Docker Compose for ELK Stack:**
```yaml
version: '3.9'
services:
elasticsearch:
image: docker.elastic.co/elasticsearch/elasticsearch:8.10.0
container_name: erp-elasticsearch
environment:
- discovery.type=single-node
- ES_JAVA_OPTS=-Xms2g -Xmx2g
- xpack.security.enabled=false
volumes:
- elasticsearch_data:/usr/share/elasticsearch/data
ports:
- "9200:9200"
networks:
- monitoring
restart: always
logstash:
image: docker.elastic.co/logstash/logstash:8.10.0
container_name: erp-logstash
volumes:
- ./logstash/logstash.conf:/usr/share/logstash/pipeline/logstash.conf:ro
ports:
- "5044:5044"
environment:
LS_JAVA_OPTS: "-Xmx512m -Xms512m"
networks:
- monitoring
depends_on:
- elasticsearch
restart: always
kibana:
image: docker.elastic.co/kibana/kibana:8.10.0
container_name: erp-kibana
ports:
- "5601:5601"
environment:
ELASTICSEARCH_URL: http://elasticsearch:9200
ELASTICSEARCH_HOSTS: '["http://elasticsearch:9200"]'
networks:
- monitoring
depends_on:
- elasticsearch
restart: always
volumes:
elasticsearch_data:
networks:
monitoring:
external: true
name: erp-monitoring
```
**Logstash Configuration:**
```conf
input {
file {
path => "/var/log/erp-generic/application-*.log"
type => "application"
codec => json
start_position => "beginning"
}
file {
path => "/var/log/erp-generic/error-*.log"
type => "error"
codec => json
start_position => "beginning"
}
file {
path => "/var/log/erp-generic/audit-*.log"
type => "audit"
codec => json
start_position => "beginning"
}
}
filter {
# Parse timestamp
date {
match => [ "timestamp", "ISO8601" ]
target => "@timestamp"
}
# Add geoip for IP addresses
if [ip] {
geoip {
source => "ip"
target => "geoip"
}
}
# Extract tenant_id as a field
if [tenantId] {
mutate {
add_field => { "tenant" => "%{tenantId}" }
}
}
}
output {
elasticsearch {
hosts => ["elasticsearch:9200"]
index => "erp-generic-logs-%{+YYYY.MM.dd}"
}
# Debug output (optional)
stdout {
codec => rubydebug
}
}
```
---
## 7. APPLICATION PERFORMANCE MONITORING (APM)
### 7.1 Custom Metrics Endpoints
**File:** `backend/src/metrics/metrics.controller.ts`
```typescript
import { Controller, Get } from '@nestjs/common';
import { MetricsService } from '../common/metrics/metrics.service';
import { PrismaService } from '../common/prisma/prisma.service';
@Controller('metrics')
export class MetricsController {
constructor(
private metricsService: MetricsService,
private prisma: PrismaService,
) {}
@Get()
getMetrics() {
return this.metricsService.getMetrics();
}
@Get('business')
async getBusinessMetrics() {
// Aggregate business metrics from database
const [salesOrders, purchaseOrders, invoices, activeUsers] = await Promise.all([
this.prisma.salesOrder.count(),
this.prisma.purchaseOrder.count(),
this.prisma.invoice.count(),
this.prisma.user.count({ where: { status: 'active' } }),
]);
return {
sales_orders_total: salesOrders,
purchase_orders_total: purchaseOrders,
invoices_total: invoices,
active_users_total: activeUsers,
};
}
}
```
### 7.2 Performance Profiling
**Prisma Query Logging:**
```typescript
// prisma/prisma.service.ts
import { Injectable, OnModuleInit } from '@nestjs/common';
import { PrismaClient } from '@prisma/client';
import { LoggerService } from '../logger/logger.service';
@Injectable()
export class PrismaService extends PrismaClient implements OnModuleInit {
constructor(private logger: LoggerService) {
super({
log: [
{ emit: 'event', level: 'query' },
{ emit: 'event', level: 'error' },
{ emit: 'event', level: 'warn' },
],
});
// Log slow queries (>100ms)
this.$on('query' as never, (e: any) => {
if (e.duration > 100) {
this.logger.warn('SLOW_QUERY', 'PrismaService', {
query: e.query,
duration: e.duration,
params: e.params,
});
}
});
// Log query errors
this.$on('error' as never, (e: any) => {
this.logger.error('DB_ERROR', e.message, 'PrismaService', {
target: e.target,
});
});
}
async onModuleInit() {
await this.$connect();
}
}
```
---
## 8. HEALTH CHECKS
### 8.1 Health Check Endpoints
```typescript
// health/health.controller.ts
import { Controller, Get } from '@nestjs/common';
import { HealthCheck, HealthCheckService, PrismaHealthIndicator, MemoryHealthIndicator, DiskHealthIndicator } from '@nestjs/terminus';
import { RedisHealthIndicator } from './redis.health';
@Controller('health')
export class HealthController {
constructor(
private health: HealthCheckService,
private db: PrismaHealthIndicator,
private redis: RedisHealthIndicator,
private memory: MemoryHealthIndicator,
private disk: DiskHealthIndicator,
) {}
@Get()
@HealthCheck()
check() {
return this.health.check([
() => this.db.pingCheck('database', { timeout: 3000 }),
() => this.redis.isHealthy('redis'),
() => this.memory.checkHeap('memory_heap', 200 * 1024 * 1024),
() => this.disk.checkStorage('disk', { path: '/', thresholdPercent: 0.9 }),
]);
}
@Get('live')
liveness() {
return { status: 'ok', timestamp: new Date().toISOString() };
}
@Get('ready')
@HealthCheck()
readiness() {
return this.health.check([
() => this.db.pingCheck('database'),
() => this.redis.isHealthy('redis'),
]);
}
}
```
---
## 9. DISTRIBUTED TRACING
### 9.1 OpenTelemetry Setup
```typescript
// tracing.ts (Bootstrap file)
import { NodeSDK } from '@opentelemetry/sdk-node';
import { getNodeAutoInstrumentations } from '@opentelemetry/auto-instrumentations-node';
import { JaegerExporter } from '@opentelemetry/exporter-jaeger';
const sdk = new NodeSDK({
traceExporter: new JaegerExporter({
endpoint: process.env.JAEGER_ENDPOINT || 'http://jaeger:14268/api/traces',
}),
instrumentations: [getNodeAutoInstrumentations()],
});
sdk.start();
```
---
## 10. ON-CALL & INCIDENT RESPONSE
### 10.1 On-Call Rotation
- **Primary On-Call:** DevOps Engineer (24/7)
- **Secondary On-Call:** Backend Lead
- **Escalation Path:** CTO → CEO
### 10.2 Incident Severity
| Severity | Response Time | Examples |
|----------|---------------|----------|
| **P0 (Critical)** | 15 min | System down, data loss |
| **P1 (High)** | 1 hour | Major feature broken |
| **P2 (Medium)** | 4 hours | Minor feature broken |
| **P3 (Low)** | 24 hours | Cosmetic issue |
---
## 11. REFERENCES
- [Deployment Guide](./DEPLOYMENT-GUIDE.md)
- [Prometheus Documentation](https://prometheus.io/docs/)
- [Grafana Documentation](https://grafana.com/docs/)
- [Google SRE Book](https://sre.google/sre-book/table-of-contents/)
---
**Documento:** MONITORING-OBSERVABILITY.md
**Versión:** 1.0
**Total Páginas:** ~18
**Última Actualización:** 2025-11-24