1661 lines
49 KiB
Markdown
1661 lines
49 KiB
Markdown
# MONITORING & OBSERVABILITY - ERP Generic
|
|
|
|
**Última actualización:** 2025-11-24
|
|
**Responsable:** DevOps Team / SRE Team
|
|
**Estado:** ✅ Production-Ready
|
|
|
|
---
|
|
|
|
## TABLE OF CONTENTS
|
|
|
|
1. [Overview](#1-overview)
|
|
2. [Observability Pillars](#2-observability-pillars)
|
|
3. [Prometheus Setup](#3-prometheus-setup)
|
|
4. [Grafana Dashboards](#4-grafana-dashboards)
|
|
5. [Alert Rules](#5-alert-rules)
|
|
6. [Logging Strategy](#6-logging-strategy)
|
|
7. [Application Performance Monitoring (APM)](#7-application-performance-monitoring-apm)
|
|
8. [Health Checks](#8-health-checks)
|
|
9. [Distributed Tracing](#9-distributed-tracing)
|
|
10. [On-Call & Incident Response](#10-on-call--incident-response)
|
|
11. [References](#11-references)
|
|
|
|
---
|
|
|
|
## 1. OVERVIEW
|
|
|
|
### 1.1 Monitoring Architecture
|
|
|
|
```
|
|
┌─────────────────────────────────────────────────────────────────────┐
|
|
│ Application Layer │
|
|
│ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐ │
|
|
│ │ Backend │ │ Frontend │ │ Postgres │ │ Redis │ │
|
|
│ │ (Metrics)│ │ (Metrics)│ │(Exporter)│ │(Exporter)│ │
|
|
│ └────┬─────┘ └────┬─────┘ └────┬─────┘ └────┬─────┘ │
|
|
│ │ │ │ │ │
|
|
│ └─────────────┴─────────────┴─────────────┘ │
|
|
│ │ │
|
|
└────────────────────────────┼─────────────────────────────────────────┘
|
|
│ (Scrape metrics every 15s)
|
|
↓
|
|
┌─────────────────────────────────────────────────────────────────────┐
|
|
│ Prometheus (TSDB) │
|
|
│ - Collects metrics from all targets │
|
|
│ - Evaluates alert rules │
|
|
│ - Stores time-series data (15 days retention) │
|
|
└────────┬────────────────────────────────┬─────────────────────────┘
|
|
│ │
|
|
│ (Query metrics) │ (Send alerts)
|
|
↓ ↓
|
|
┌─────────────────────┐ ┌──────────────────────┐
|
|
│ Grafana │ │ Alertmanager │
|
|
│ - Dashboards │ │ - Route alerts │
|
|
│ - Visualization │ │ - Deduplication │
|
|
│ - Alerting │ │ - Silencing │
|
|
└─────────────────────┘ └──────┬───────────────┘
|
|
│
|
|
┌───────────────────┼────────────────┐
|
|
↓ ↓ ↓
|
|
┌──────────┐ ┌──────────┐ ┌──────────┐
|
|
│ PagerDuty│ │ Slack │ │ Email │
|
|
│(On-call) │ │(#alerts) │ │(Team) │
|
|
└──────────┘ └──────────┘ └──────────┘
|
|
|
|
┌─────────────────────────────────────────────────────────────────────┐
|
|
│ Logging Pipeline │
|
|
│ │
|
|
│ Application → Winston → ELK Stack / Loki │
|
|
│ │
|
|
│ ┌──────────┐ ┌──────────────┐ ┌──────────┐ │
|
|
│ │ Logs │ ───→ │ Elasticsearch│ ───→ │ Kibana │ │
|
|
│ │(JSON) │ │ or Loki │ │(Search) │ │
|
|
│ └──────────┘ └──────────────┘ └──────────┘ │
|
|
└─────────────────────────────────────────────────────────────────────┘
|
|
|
|
┌─────────────────────────────────────────────────────────────────────┐
|
|
│ Distributed Tracing │
|
|
│ │
|
|
│ Application → OpenTelemetry → Jaeger / Tempo │
|
|
│ (Trace spans for requests across microservices) │
|
|
└─────────────────────────────────────────────────────────────────────┘
|
|
```
|
|
|
|
### 1.2 Observability Goals
|
|
|
|
**Why Observability?**
|
|
- **Proactive Monitoring:** Detect issues before users report them
|
|
- **Faster Debugging:** Reduce MTTD (Mean Time to Detect) from hours to minutes
|
|
- **Performance Optimization:** Identify bottlenecks and slow queries
|
|
- **Capacity Planning:** Predict when to scale resources
|
|
- **SLA Compliance:** Monitor uptime, response times, error rates
|
|
|
|
**Key Metrics (Google's Four Golden Signals):**
|
|
1. **Latency:** Request/response time (p50, p95, p99)
|
|
2. **Traffic:** Requests per second (throughput)
|
|
3. **Errors:** Error rate (5xx responses, exceptions)
|
|
4. **Saturation:** Resource utilization (CPU, memory, disk, DB connections)
|
|
|
|
**SLOs (Service Level Objectives):**
|
|
- **Availability:** 99.9% uptime (8.76 hours downtime/year)
|
|
- **Latency:** p95 API response < 300ms
|
|
- **Error Budget:** <0.1% error rate
|
|
- **Data Durability:** Zero data loss
|
|
|
|
---
|
|
|
|
## 2. OBSERVABILITY PILLARS
|
|
|
|
### 2.1 The Three Pillars
|
|
|
|
**1. Metrics (What is happening?)**
|
|
- Quantitative measurements over time
|
|
- Examples: CPU usage, request count, response time
|
|
- Tool: Prometheus + Grafana
|
|
|
|
**2. Logs (What happened?)**
|
|
- Discrete events with context
|
|
- Examples: "User X logged in", "Query took 2.5s"
|
|
- Tool: Winston + ELK Stack / Loki
|
|
|
|
**3. Traces (Why did it happen?)**
|
|
- Request flow across services
|
|
- Examples: API call → Database query → Redis cache → Response
|
|
- Tool: OpenTelemetry + Jaeger
|
|
|
|
### 2.2 Correlation
|
|
|
|
```
|
|
Example: High p99 latency alert
|
|
├── Metrics: p99 latency = 3s (threshold: 500ms)
|
|
│ └── Which endpoint? /api/products
|
|
│
|
|
├── Logs: Search for slow queries in /api/products
|
|
│ └── Found: SELECT * FROM inventory.stock_movements (2.8s)
|
|
│
|
|
└── Traces: Trace ID abc123 shows:
|
|
├── API handler: 50ms
|
|
├── Database query: 2800ms ← Bottleneck!
|
|
└── Response serialization: 150ms
|
|
|
|
Root cause: Missing index on inventory.stock_movements(product_id)
|
|
Fix: CREATE INDEX idx_stock_movements_product_id ON inventory.stock_movements(product_id);
|
|
```
|
|
|
|
---
|
|
|
|
## 3. PROMETHEUS SETUP
|
|
|
|
### 3.1 Prometheus Configuration
|
|
|
|
**File:** `prometheus/prometheus.yml`
|
|
|
|
```yaml
|
|
global:
|
|
scrape_interval: 15s # Scrape targets every 15 seconds
|
|
evaluation_interval: 15s # Evaluate rules every 15 seconds
|
|
scrape_timeout: 10s
|
|
external_labels:
|
|
cluster: 'erp-generic-prod'
|
|
environment: 'production'
|
|
|
|
# Alertmanager configuration
|
|
alerting:
|
|
alertmanagers:
|
|
- static_configs:
|
|
- targets:
|
|
- alertmanager:9093
|
|
timeout: 10s
|
|
|
|
# Load alert rules
|
|
rule_files:
|
|
- '/etc/prometheus/alerts/application.yml'
|
|
- '/etc/prometheus/alerts/infrastructure.yml'
|
|
- '/etc/prometheus/alerts/database.yml'
|
|
- '/etc/prometheus/alerts/business.yml'
|
|
|
|
# Scrape configurations
|
|
scrape_configs:
|
|
# Backend API (NestJS with Prometheus middleware)
|
|
- job_name: 'erp-backend'
|
|
static_configs:
|
|
- targets: ['backend:3000']
|
|
labels:
|
|
service: 'backend'
|
|
component: 'api'
|
|
metrics_path: '/metrics'
|
|
scrape_interval: 15s
|
|
|
|
# PostgreSQL Exporter
|
|
- job_name: 'postgres'
|
|
static_configs:
|
|
- targets: ['postgres-exporter:9187']
|
|
labels:
|
|
service: 'database'
|
|
component: 'postgres'
|
|
scrape_interval: 30s
|
|
|
|
# Redis Exporter
|
|
- job_name: 'redis'
|
|
static_configs:
|
|
- targets: ['redis-exporter:9121']
|
|
labels:
|
|
service: 'cache'
|
|
component: 'redis'
|
|
scrape_interval: 30s
|
|
|
|
# Node Exporter (system metrics)
|
|
- job_name: 'node'
|
|
static_configs:
|
|
- targets: ['node-exporter:9100']
|
|
labels:
|
|
service: 'infrastructure'
|
|
component: 'host'
|
|
scrape_interval: 15s
|
|
|
|
# Frontend (Nginx metrics)
|
|
- job_name: 'nginx'
|
|
static_configs:
|
|
- targets: ['nginx-exporter:9113']
|
|
labels:
|
|
service: 'frontend'
|
|
component: 'nginx'
|
|
scrape_interval: 30s
|
|
|
|
# Prometheus itself (meta-monitoring)
|
|
- job_name: 'prometheus'
|
|
static_configs:
|
|
- targets: ['localhost:9090']
|
|
labels:
|
|
service: 'monitoring'
|
|
component: 'prometheus'
|
|
```
|
|
|
|
### 3.2 Docker Compose for Monitoring Stack
|
|
|
|
**File:** `docker-compose.monitoring.yml`
|
|
|
|
```yaml
|
|
version: '3.9'
|
|
|
|
services:
|
|
prometheus:
|
|
image: prom/prometheus:v2.47.0
|
|
container_name: erp-prometheus
|
|
volumes:
|
|
- ./prometheus/prometheus.yml:/etc/prometheus/prometheus.yml:ro
|
|
- ./prometheus/alerts:/etc/prometheus/alerts:ro
|
|
- prometheus_data:/prometheus
|
|
command:
|
|
- '--config.file=/etc/prometheus/prometheus.yml'
|
|
- '--storage.tsdb.path=/prometheus'
|
|
- '--storage.tsdb.retention.time=15d'
|
|
- '--web.console.libraries=/usr/share/prometheus/console_libraries'
|
|
- '--web.console.templates=/usr/share/prometheus/consoles'
|
|
- '--web.enable-lifecycle'
|
|
ports:
|
|
- "9090:9090"
|
|
networks:
|
|
- monitoring
|
|
restart: always
|
|
|
|
alertmanager:
|
|
image: prom/alertmanager:v0.26.0
|
|
container_name: erp-alertmanager
|
|
volumes:
|
|
- ./alertmanager/alertmanager.yml:/etc/alertmanager/alertmanager.yml:ro
|
|
- alertmanager_data:/alertmanager
|
|
command:
|
|
- '--config.file=/etc/alertmanager/alertmanager.yml'
|
|
- '--storage.path=/alertmanager'
|
|
ports:
|
|
- "9093:9093"
|
|
networks:
|
|
- monitoring
|
|
restart: always
|
|
|
|
grafana:
|
|
image: grafana/grafana:10.1.0
|
|
container_name: erp-grafana
|
|
environment:
|
|
- GF_SECURITY_ADMIN_USER=${GRAFANA_ADMIN_USER:-admin}
|
|
- GF_SECURITY_ADMIN_PASSWORD=${GRAFANA_ADMIN_PASSWORD:-admin}
|
|
- GF_INSTALL_PLUGINS=grafana-piechart-panel,grafana-clock-panel
|
|
- GF_SERVER_ROOT_URL=https://grafana.erp-generic.com
|
|
- GF_SMTP_ENABLED=true
|
|
- GF_SMTP_HOST=${SMTP_HOST}:${SMTP_PORT}
|
|
- GF_SMTP_USER=${SMTP_USER}
|
|
- GF_SMTP_PASSWORD=${SMTP_PASSWORD}
|
|
volumes:
|
|
- grafana_data:/var/lib/grafana
|
|
- ./grafana/provisioning:/etc/grafana/provisioning:ro
|
|
- ./grafana/dashboards:/var/lib/grafana/dashboards:ro
|
|
ports:
|
|
- "3001:3000"
|
|
networks:
|
|
- monitoring
|
|
depends_on:
|
|
- prometheus
|
|
restart: always
|
|
|
|
postgres-exporter:
|
|
image: prometheuscommunity/postgres-exporter:v0.14.0
|
|
container_name: erp-postgres-exporter
|
|
environment:
|
|
DATA_SOURCE_NAME: "postgresql://${POSTGRES_USER}:${POSTGRES_PASSWORD}@postgres:5432/${POSTGRES_DB}?sslmode=disable"
|
|
ports:
|
|
- "9187:9187"
|
|
networks:
|
|
- monitoring
|
|
- erp-network
|
|
restart: always
|
|
|
|
redis-exporter:
|
|
image: oliver006/redis_exporter:v1.54.0
|
|
container_name: erp-redis-exporter
|
|
environment:
|
|
REDIS_ADDR: "redis:6379"
|
|
REDIS_PASSWORD: ${REDIS_PASSWORD}
|
|
ports:
|
|
- "9121:9121"
|
|
networks:
|
|
- monitoring
|
|
- erp-network
|
|
restart: always
|
|
|
|
node-exporter:
|
|
image: prom/node-exporter:v1.6.1
|
|
container_name: erp-node-exporter
|
|
command:
|
|
- '--path.procfs=/host/proc'
|
|
- '--path.sysfs=/host/sys'
|
|
- '--collector.filesystem.mount-points-exclude=^/(sys|proc|dev|host|etc)($$|/)'
|
|
volumes:
|
|
- /proc:/host/proc:ro
|
|
- /sys:/host/sys:ro
|
|
- /:/rootfs:ro
|
|
ports:
|
|
- "9100:9100"
|
|
networks:
|
|
- monitoring
|
|
restart: always
|
|
|
|
volumes:
|
|
prometheus_data:
|
|
alertmanager_data:
|
|
grafana_data:
|
|
|
|
networks:
|
|
monitoring:
|
|
name: erp-monitoring
|
|
erp-network:
|
|
external: true
|
|
name: erp-network-internal
|
|
```
|
|
|
|
### 3.3 Backend Metrics Instrumentation
|
|
|
|
**File:** `backend/src/common/metrics/metrics.module.ts`
|
|
|
|
```typescript
|
|
import { Module } from '@nestjs/common';
|
|
import { PrometheusModule } from '@willsoto/nestjs-prometheus';
|
|
import { MetricsService } from './metrics.service';
|
|
|
|
@Module({
|
|
imports: [
|
|
PrometheusModule.register({
|
|
path: '/metrics',
|
|
defaultMetrics: {
|
|
enabled: true,
|
|
config: {
|
|
prefix: 'erp_',
|
|
},
|
|
},
|
|
}),
|
|
],
|
|
providers: [MetricsService],
|
|
exports: [MetricsService],
|
|
})
|
|
export class MetricsModule {}
|
|
```
|
|
|
|
**File:** `backend/src/common/metrics/metrics.service.ts`
|
|
|
|
```typescript
|
|
import { Injectable } from '@nestjs/common';
|
|
import { Counter, Histogram, Gauge, Registry } from 'prom-client';
|
|
|
|
@Injectable()
|
|
export class MetricsService {
|
|
private registry: Registry;
|
|
|
|
// HTTP Metrics
|
|
private httpRequestDuration: Histogram;
|
|
private httpRequestTotal: Counter;
|
|
private httpRequestErrors: Counter;
|
|
|
|
// Database Metrics
|
|
private dbQueryDuration: Histogram;
|
|
private dbConnectionsActive: Gauge;
|
|
private dbQueryErrors: Counter;
|
|
|
|
// Business Metrics
|
|
private salesOrdersCreated: Counter;
|
|
private purchaseOrdersCreated: Counter;
|
|
private invoicesGenerated: Counter;
|
|
private inventoryMovements: Counter;
|
|
|
|
// Cache Metrics
|
|
private cacheHits: Counter;
|
|
private cacheMisses: Counter;
|
|
|
|
// Authentication Metrics
|
|
private loginAttempts: Counter;
|
|
private loginFailures: Counter;
|
|
private activeUsers: Gauge;
|
|
|
|
constructor() {
|
|
this.registry = new Registry();
|
|
this.initializeMetrics();
|
|
}
|
|
|
|
private initializeMetrics() {
|
|
// HTTP Request Duration
|
|
this.httpRequestDuration = new Histogram({
|
|
name: 'erp_http_request_duration_seconds',
|
|
help: 'Duration of HTTP requests in seconds',
|
|
labelNames: ['method', 'route', 'status_code'],
|
|
buckets: [0.01, 0.05, 0.1, 0.3, 0.5, 1, 2, 5],
|
|
});
|
|
|
|
// HTTP Request Total
|
|
this.httpRequestTotal = new Counter({
|
|
name: 'erp_http_requests_total',
|
|
help: 'Total number of HTTP requests',
|
|
labelNames: ['method', 'route', 'status_code'],
|
|
});
|
|
|
|
// HTTP Request Errors
|
|
this.httpRequestErrors = new Counter({
|
|
name: 'erp_http_request_errors_total',
|
|
help: 'Total number of HTTP request errors',
|
|
labelNames: ['method', 'route', 'error_type'],
|
|
});
|
|
|
|
// Database Query Duration
|
|
this.dbQueryDuration = new Histogram({
|
|
name: 'erp_db_query_duration_seconds',
|
|
help: 'Duration of database queries in seconds',
|
|
labelNames: ['operation', 'table'],
|
|
buckets: [0.001, 0.005, 0.01, 0.05, 0.1, 0.5, 1, 2],
|
|
});
|
|
|
|
// Database Active Connections
|
|
this.dbConnectionsActive = new Gauge({
|
|
name: 'erp_db_connections_active',
|
|
help: 'Number of active database connections',
|
|
});
|
|
|
|
// Database Query Errors
|
|
this.dbQueryErrors = new Counter({
|
|
name: 'erp_db_query_errors_total',
|
|
help: 'Total number of database query errors',
|
|
labelNames: ['operation', 'error_type'],
|
|
});
|
|
|
|
// Business Metrics - Sales Orders
|
|
this.salesOrdersCreated = new Counter({
|
|
name: 'erp_sales_orders_created_total',
|
|
help: 'Total number of sales orders created',
|
|
labelNames: ['tenant_id', 'status'],
|
|
});
|
|
|
|
// Business Metrics - Purchase Orders
|
|
this.purchaseOrdersCreated = new Counter({
|
|
name: 'erp_purchase_orders_created_total',
|
|
help: 'Total number of purchase orders created',
|
|
labelNames: ['tenant_id', 'status'],
|
|
});
|
|
|
|
// Business Metrics - Invoices
|
|
this.invoicesGenerated = new Counter({
|
|
name: 'erp_invoices_generated_total',
|
|
help: 'Total number of invoices generated',
|
|
labelNames: ['tenant_id', 'type'],
|
|
});
|
|
|
|
// Business Metrics - Inventory Movements
|
|
this.inventoryMovements = new Counter({
|
|
name: 'erp_inventory_movements_total',
|
|
help: 'Total number of inventory movements',
|
|
labelNames: ['tenant_id', 'type'],
|
|
});
|
|
|
|
// Cache Hits
|
|
this.cacheHits = new Counter({
|
|
name: 'erp_cache_hits_total',
|
|
help: 'Total number of cache hits',
|
|
labelNames: ['cache_key'],
|
|
});
|
|
|
|
// Cache Misses
|
|
this.cacheMisses = new Counter({
|
|
name: 'erp_cache_misses_total',
|
|
help: 'Total number of cache misses',
|
|
labelNames: ['cache_key'],
|
|
});
|
|
|
|
// Login Attempts
|
|
this.loginAttempts = new Counter({
|
|
name: 'erp_login_attempts_total',
|
|
help: 'Total number of login attempts',
|
|
labelNames: ['tenant_id', 'method'],
|
|
});
|
|
|
|
// Login Failures
|
|
this.loginFailures = new Counter({
|
|
name: 'erp_login_failures_total',
|
|
help: 'Total number of failed login attempts',
|
|
labelNames: ['tenant_id', 'reason'],
|
|
});
|
|
|
|
// Active Users
|
|
this.activeUsers = new Gauge({
|
|
name: 'erp_active_users',
|
|
help: 'Number of currently active users',
|
|
labelNames: ['tenant_id'],
|
|
});
|
|
|
|
// Register all metrics
|
|
this.registry.registerMetric(this.httpRequestDuration);
|
|
this.registry.registerMetric(this.httpRequestTotal);
|
|
this.registry.registerMetric(this.httpRequestErrors);
|
|
this.registry.registerMetric(this.dbQueryDuration);
|
|
this.registry.registerMetric(this.dbConnectionsActive);
|
|
this.registry.registerMetric(this.dbQueryErrors);
|
|
this.registry.registerMetric(this.salesOrdersCreated);
|
|
this.registry.registerMetric(this.purchaseOrdersCreated);
|
|
this.registry.registerMetric(this.invoicesGenerated);
|
|
this.registry.registerMetric(this.inventoryMovements);
|
|
this.registry.registerMetric(this.cacheHits);
|
|
this.registry.registerMetric(this.cacheMisses);
|
|
this.registry.registerMetric(this.loginAttempts);
|
|
this.registry.registerMetric(this.loginFailures);
|
|
this.registry.registerMetric(this.activeUsers);
|
|
}
|
|
|
|
// Public methods to record metrics
|
|
recordHttpRequest(method: string, route: string, statusCode: number, duration: number) {
|
|
this.httpRequestDuration.observe({ method, route, status_code: statusCode }, duration);
|
|
this.httpRequestTotal.inc({ method, route, status_code: statusCode });
|
|
}
|
|
|
|
recordHttpError(method: string, route: string, errorType: string) {
|
|
this.httpRequestErrors.inc({ method, route, error_type: errorType });
|
|
}
|
|
|
|
recordDbQuery(operation: string, table: string, duration: number) {
|
|
this.dbQueryDuration.observe({ operation, table }, duration);
|
|
}
|
|
|
|
recordDbError(operation: string, errorType: string) {
|
|
this.dbQueryErrors.inc({ operation, error_type: errorType });
|
|
}
|
|
|
|
setDbConnectionsActive(count: number) {
|
|
this.dbConnectionsActive.set(count);
|
|
}
|
|
|
|
recordSalesOrder(tenantId: string, status: string) {
|
|
this.salesOrdersCreated.inc({ tenant_id: tenantId, status });
|
|
}
|
|
|
|
recordPurchaseOrder(tenantId: string, status: string) {
|
|
this.purchaseOrdersCreated.inc({ tenant_id: tenantId, status });
|
|
}
|
|
|
|
recordInvoice(tenantId: string, type: string) {
|
|
this.invoicesGenerated.inc({ tenant_id: tenantId, type });
|
|
}
|
|
|
|
recordInventoryMovement(tenantId: string, type: string) {
|
|
this.inventoryMovements.inc({ tenant_id: tenantId, type });
|
|
}
|
|
|
|
recordCacheHit(key: string) {
|
|
this.cacheHits.inc({ cache_key: key });
|
|
}
|
|
|
|
recordCacheMiss(key: string) {
|
|
this.cacheMisses.inc({ cache_key: key });
|
|
}
|
|
|
|
recordLoginAttempt(tenantId: string, method: string) {
|
|
this.loginAttempts.inc({ tenant_id: tenantId, method });
|
|
}
|
|
|
|
recordLoginFailure(tenantId: string, reason: string) {
|
|
this.loginFailures.inc({ tenant_id: tenantId, reason });
|
|
}
|
|
|
|
setActiveUsers(tenantId: string, count: number) {
|
|
this.activeUsers.set({ tenant_id: tenantId }, count);
|
|
}
|
|
|
|
getMetrics(): string {
|
|
return this.registry.metrics();
|
|
}
|
|
}
|
|
```
|
|
|
|
**File:** `backend/src/common/interceptors/metrics.interceptor.ts`
|
|
|
|
```typescript
|
|
import { Injectable, NestInterceptor, ExecutionContext, CallHandler } from '@nestjs/common';
|
|
import { Observable } from 'rxjs';
|
|
import { tap } from 'rxjs/operators';
|
|
import { MetricsService } from '../metrics/metrics.service';
|
|
|
|
@Injectable()
|
|
export class MetricsInterceptor implements NestInterceptor {
|
|
constructor(private metricsService: MetricsService) {}
|
|
|
|
intercept(context: ExecutionContext, next: CallHandler): Observable<any> {
|
|
const request = context.switchToHttp().getRequest();
|
|
const startTime = Date.now();
|
|
|
|
return next.handle().pipe(
|
|
tap({
|
|
next: () => {
|
|
const response = context.switchToHttp().getResponse();
|
|
const duration = (Date.now() - startTime) / 1000; // Convert to seconds
|
|
|
|
this.metricsService.recordHttpRequest(
|
|
request.method,
|
|
request.route?.path || request.url,
|
|
response.statusCode,
|
|
duration,
|
|
);
|
|
},
|
|
error: (error) => {
|
|
const duration = (Date.now() - startTime) / 1000;
|
|
const response = context.switchToHttp().getResponse();
|
|
|
|
this.metricsService.recordHttpRequest(
|
|
request.method,
|
|
request.route?.path || request.url,
|
|
response.statusCode || 500,
|
|
duration,
|
|
);
|
|
|
|
this.metricsService.recordHttpError(
|
|
request.method,
|
|
request.route?.path || request.url,
|
|
error.name || 'UnknownError',
|
|
);
|
|
},
|
|
}),
|
|
);
|
|
}
|
|
}
|
|
```
|
|
|
|
---
|
|
|
|
## 4. GRAFANA DASHBOARDS
|
|
|
|
### 4.1 Dashboard Provisioning
|
|
|
|
**File:** `grafana/provisioning/datasources/prometheus.yml`
|
|
|
|
```yaml
|
|
apiVersion: 1
|
|
|
|
datasources:
|
|
- name: Prometheus
|
|
type: prometheus
|
|
access: proxy
|
|
url: http://prometheus:9090
|
|
isDefault: true
|
|
editable: false
|
|
jsonData:
|
|
timeInterval: "15s"
|
|
queryTimeout: "60s"
|
|
httpMethod: "POST"
|
|
```
|
|
|
|
**File:** `grafana/provisioning/dashboards/dashboard-provider.yml`
|
|
|
|
```yaml
|
|
apiVersion: 1
|
|
|
|
providers:
|
|
- name: 'ERP Generic Dashboards'
|
|
orgId: 1
|
|
folder: ''
|
|
type: file
|
|
disableDeletion: false
|
|
updateIntervalSeconds: 10
|
|
allowUiUpdates: true
|
|
options:
|
|
path: /var/lib/grafana/dashboards
|
|
foldersFromFilesStructure: true
|
|
```
|
|
|
|
### 4.2 Dashboard 1: Application Performance
|
|
|
|
**File:** `grafana/dashboards/application-performance.json` (Simplified structure)
|
|
|
|
```json
|
|
{
|
|
"dashboard": {
|
|
"title": "ERP Generic - Application Performance",
|
|
"tags": ["erp", "application", "performance"],
|
|
"timezone": "browser",
|
|
"panels": [
|
|
{
|
|
"title": "Request Rate (req/s)",
|
|
"type": "graph",
|
|
"targets": [
|
|
{
|
|
"expr": "rate(erp_http_requests_total[5m])",
|
|
"legendFormat": "{{method}} {{route}}"
|
|
}
|
|
]
|
|
},
|
|
{
|
|
"title": "P95 Latency (ms)",
|
|
"type": "graph",
|
|
"targets": [
|
|
{
|
|
"expr": "histogram_quantile(0.95, rate(erp_http_request_duration_seconds_bucket[5m])) * 1000",
|
|
"legendFormat": "{{route}}"
|
|
}
|
|
],
|
|
"thresholds": [
|
|
{ "value": 300, "color": "yellow" },
|
|
{ "value": 500, "color": "red" }
|
|
]
|
|
},
|
|
{
|
|
"title": "Error Rate (%)",
|
|
"type": "graph",
|
|
"targets": [
|
|
{
|
|
"expr": "rate(erp_http_request_errors_total[5m]) / rate(erp_http_requests_total[5m]) * 100",
|
|
"legendFormat": "{{route}}"
|
|
}
|
|
],
|
|
"thresholds": [
|
|
{ "value": 1, "color": "yellow" },
|
|
{ "value": 5, "color": "red" }
|
|
]
|
|
},
|
|
{
|
|
"title": "Top 10 Slowest Endpoints",
|
|
"type": "table",
|
|
"targets": [
|
|
{
|
|
"expr": "topk(10, avg by (route) (erp_http_request_duration_seconds))",
|
|
"format": "table"
|
|
}
|
|
]
|
|
},
|
|
{
|
|
"title": "Active Users by Tenant",
|
|
"type": "graph",
|
|
"targets": [
|
|
{
|
|
"expr": "erp_active_users",
|
|
"legendFormat": "{{tenant_id}}"
|
|
}
|
|
]
|
|
},
|
|
{
|
|
"title": "Cache Hit Rate (%)",
|
|
"type": "stat",
|
|
"targets": [
|
|
{
|
|
"expr": "rate(erp_cache_hits_total[5m]) / (rate(erp_cache_hits_total[5m]) + rate(erp_cache_misses_total[5m])) * 100"
|
|
}
|
|
]
|
|
}
|
|
]
|
|
}
|
|
}
|
|
```
|
|
|
|
**Key Panels:**
|
|
1. **Request Rate:** Total requests per second (by method and route)
|
|
2. **P95 Latency:** 95th percentile response time (threshold: 300ms yellow, 500ms red)
|
|
3. **Error Rate:** Percentage of failed requests (threshold: 1% yellow, 5% red)
|
|
4. **Top 10 Slowest Endpoints:** Identify performance bottlenecks
|
|
5. **Active Users by Tenant:** Real-time active user count per tenant
|
|
6. **Cache Hit Rate:** Percentage of cache hits (target: >80%)
|
|
|
|
### 4.3 Dashboard 2: Database Performance
|
|
|
|
**Key Panels:**
|
|
1. **Database Connections:** Active vs. max connections
|
|
2. **Query Duration P95:** 95th percentile query time by table
|
|
3. **Slow Queries:** Queries taking >1 second
|
|
4. **Transactions per Second:** TPS rate
|
|
5. **Database Size:** Disk usage by schema
|
|
6. **Index Usage:** Most and least used indexes
|
|
7. **Lock Waits:** Blocking queries
|
|
8. **Replication Lag:** Lag between primary and replicas (if applicable)
|
|
|
|
**Example Queries:**
|
|
```promql
|
|
# Active connections
|
|
pg_stat_database_numbackends{datname="erp_generic"}
|
|
|
|
# Slow queries (>1s)
|
|
rate(pg_stat_statements_mean_exec_time{datname="erp_generic"}[5m]) > 1000
|
|
|
|
# Database size
|
|
pg_database_size_bytes{datname="erp_generic"}
|
|
|
|
# TPS
|
|
rate(pg_stat_database_xact_commit{datname="erp_generic"}[5m]) + rate(pg_stat_database_xact_rollback{datname="erp_generic"}[5m])
|
|
```
|
|
|
|
### 4.4 Dashboard 3: Business Metrics
|
|
|
|
**Key Panels:**
|
|
1. **Sales Orders Created (Today):** Total sales orders by status
|
|
2. **Purchase Orders Created (Today):** Total purchase orders by status
|
|
3. **Revenue Trend (Last 30 days):** Daily revenue by tenant
|
|
4. **Invoices Generated (Today):** Total invoices by type (customer/supplier)
|
|
5. **Inventory Movements (Today):** Stock in/out movements
|
|
6. **Top 10 Customers by Revenue:** Revenue breakdown
|
|
7. **Order Fulfillment Rate:** Percentage of orders fulfilled on time
|
|
8. **Average Order Value:** Mean order value by tenant
|
|
|
|
**Example Queries:**
|
|
```promql
|
|
# Sales orders created today
|
|
increase(erp_sales_orders_created_total[1d])
|
|
|
|
# Revenue trend (requires custom metric)
|
|
sum by (tenant_id) (rate(erp_sales_order_amount_sum[1d]))
|
|
|
|
# Top 10 customers by revenue
|
|
topk(10, sum by (customer_id) (erp_sales_order_amount_sum))
|
|
```
|
|
|
|
---
|
|
|
|
## 5. ALERT RULES
|
|
|
|
### 5.1 Alertmanager Configuration
|
|
|
|
**File:** `alertmanager/alertmanager.yml`
|
|
|
|
```yaml
|
|
global:
|
|
resolve_timeout: 5m
|
|
smtp_smarthost: '${SMTP_HOST}:${SMTP_PORT}'
|
|
smtp_from: 'alertmanager@erp-generic.com'
|
|
smtp_auth_username: '${SMTP_USER}'
|
|
smtp_auth_password: '${SMTP_PASSWORD}'
|
|
slack_api_url: '${SLACK_WEBHOOK_URL}'
|
|
pagerduty_url: 'https://events.pagerduty.com/v2/enqueue'
|
|
|
|
# Route alerts to different receivers
|
|
route:
|
|
receiver: 'default'
|
|
group_by: ['alertname', 'cluster', 'service']
|
|
group_wait: 10s
|
|
group_interval: 10s
|
|
repeat_interval: 12h
|
|
|
|
routes:
|
|
# Critical alerts → PagerDuty (on-call)
|
|
- receiver: 'pagerduty'
|
|
match:
|
|
severity: critical
|
|
continue: true
|
|
|
|
# All alerts → Slack
|
|
- receiver: 'slack'
|
|
match_re:
|
|
severity: critical|warning
|
|
|
|
# Database alerts → DBA team
|
|
- receiver: 'dba-email'
|
|
match:
|
|
component: postgres
|
|
|
|
# Security alerts → Security team
|
|
- receiver: 'security-email'
|
|
match_re:
|
|
alertname: '.*Security.*'
|
|
|
|
# Inhibition rules (suppress alerts)
|
|
inhibit_rules:
|
|
# Suppress warning if critical already firing
|
|
- source_match:
|
|
severity: 'critical'
|
|
target_match:
|
|
severity: 'warning'
|
|
equal: ['alertname', 'instance']
|
|
|
|
receivers:
|
|
- name: 'default'
|
|
email_configs:
|
|
- to: 'devops@erp-generic.com'
|
|
headers:
|
|
Subject: '[ERP Alert] {{ .GroupLabels.alertname }}'
|
|
|
|
- name: 'pagerduty'
|
|
pagerduty_configs:
|
|
- service_key: '${PAGERDUTY_SERVICE_KEY}'
|
|
description: '{{ .GroupLabels.alertname }}: {{ .GroupLabels.instance }}'
|
|
|
|
- name: 'slack'
|
|
slack_configs:
|
|
- channel: '#erp-alerts'
|
|
title: '{{ .GroupLabels.alertname }}'
|
|
text: '{{ range .Alerts }}{{ .Annotations.description }}{{ end }}'
|
|
color: '{{ if eq .Status "firing" }}danger{{ else }}good{{ end }}'
|
|
|
|
- name: 'dba-email'
|
|
email_configs:
|
|
- to: 'dba@erp-generic.com'
|
|
headers:
|
|
Subject: '[Database Alert] {{ .GroupLabels.alertname }}'
|
|
|
|
- name: 'security-email'
|
|
email_configs:
|
|
- to: 'security@erp-generic.com'
|
|
headers:
|
|
Subject: '[SECURITY ALERT] {{ .GroupLabels.alertname }}'
|
|
Priority: 'urgent'
|
|
```
|
|
|
|
### 5.2 Application Alert Rules
|
|
|
|
**File:** `prometheus/alerts/application.yml`
|
|
|
|
```yaml
|
|
groups:
|
|
- name: erp_application_alerts
|
|
interval: 30s
|
|
rules:
|
|
# High Error Rate
|
|
- alert: HighErrorRate
|
|
expr: |
|
|
(rate(erp_http_request_errors_total[5m]) / rate(erp_http_requests_total[5m])) > 0.05
|
|
for: 5m
|
|
labels:
|
|
severity: critical
|
|
component: backend
|
|
annotations:
|
|
summary: "High error rate detected on {{ $labels.instance }}"
|
|
description: "Error rate is {{ $value | humanizePercentage }} (threshold: 5%)"
|
|
runbook: "https://wiki.erp-generic.com/runbooks/high-error-rate"
|
|
|
|
# High P95 Latency
|
|
- alert: HighLatency
|
|
expr: |
|
|
histogram_quantile(0.95, rate(erp_http_request_duration_seconds_bucket[5m])) > 0.5
|
|
for: 10m
|
|
labels:
|
|
severity: warning
|
|
component: backend
|
|
annotations:
|
|
summary: "High P95 latency on {{ $labels.route }}"
|
|
description: "P95 latency is {{ $value }}s (threshold: 500ms)"
|
|
runbook: "https://wiki.erp-generic.com/runbooks/high-latency"
|
|
|
|
# Service Down
|
|
- alert: ServiceDown
|
|
expr: up{job="erp-backend"} == 0
|
|
for: 2m
|
|
labels:
|
|
severity: critical
|
|
component: backend
|
|
annotations:
|
|
summary: "Backend service is down"
|
|
description: "Backend {{ $labels.instance }} has been down for more than 2 minutes"
|
|
runbook: "https://wiki.erp-generic.com/runbooks/service-down"
|
|
|
|
# High CPU Usage
|
|
- alert: HighCPUUsage
|
|
expr: |
|
|
(100 - (avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)) > 80
|
|
for: 10m
|
|
labels:
|
|
severity: warning
|
|
component: infrastructure
|
|
annotations:
|
|
summary: "High CPU usage on {{ $labels.instance }}"
|
|
description: "CPU usage is {{ $value }}% (threshold: 80%)"
|
|
|
|
# High Memory Usage
|
|
- alert: HighMemoryUsage
|
|
expr: |
|
|
(1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) > 0.85
|
|
for: 5m
|
|
labels:
|
|
severity: warning
|
|
component: infrastructure
|
|
annotations:
|
|
summary: "High memory usage on {{ $labels.instance }}"
|
|
description: "Memory usage is {{ $value | humanizePercentage }} (threshold: 85%)"
|
|
|
|
# Disk Space Low
|
|
- alert: DiskSpaceLow
|
|
expr: |
|
|
(node_filesystem_avail_bytes / node_filesystem_size_bytes) < 0.15
|
|
for: 5m
|
|
labels:
|
|
severity: warning
|
|
component: infrastructure
|
|
annotations:
|
|
summary: "Low disk space on {{ $labels.instance }}"
|
|
description: "Disk {{ $labels.mountpoint }} has only {{ $value | humanizePercentage }} free"
|
|
|
|
# Too Many Requests (DDoS protection)
|
|
- alert: TooManyRequests
|
|
expr: |
|
|
rate(erp_http_requests_total[1m]) > 10000
|
|
for: 2m
|
|
labels:
|
|
severity: critical
|
|
component: security
|
|
annotations:
|
|
summary: "Abnormally high request rate detected"
|
|
description: "Request rate is {{ $value }} req/s (threshold: 10000 req/s). Possible DDoS attack."
|
|
runbook: "https://wiki.erp-generic.com/runbooks/ddos-attack"
|
|
|
|
# Low Cache Hit Rate
|
|
- alert: LowCacheHitRate
|
|
expr: |
|
|
(rate(erp_cache_hits_total[5m]) / (rate(erp_cache_hits_total[5m]) + rate(erp_cache_misses_total[5m]))) < 0.6
|
|
for: 15m
|
|
labels:
|
|
severity: warning
|
|
component: cache
|
|
annotations:
|
|
summary: "Low cache hit rate"
|
|
description: "Cache hit rate is {{ $value | humanizePercentage }} (threshold: 60%)"
|
|
```
|
|
|
|
### 5.3 Database Alert Rules
|
|
|
|
**File:** `prometheus/alerts/database.yml`
|
|
|
|
```yaml
|
|
groups:
|
|
- name: erp_database_alerts
|
|
interval: 30s
|
|
rules:
|
|
# Database Down
|
|
- alert: DatabaseDown
|
|
expr: pg_up == 0
|
|
for: 1m
|
|
labels:
|
|
severity: critical
|
|
component: postgres
|
|
annotations:
|
|
summary: "PostgreSQL is down"
|
|
description: "PostgreSQL on {{ $labels.instance }} has been down for more than 1 minute"
|
|
runbook: "https://wiki.erp-generic.com/runbooks/database-down"
|
|
|
|
# Connection Pool Exhausted
|
|
- alert: ConnectionPoolExhausted
|
|
expr: |
|
|
(pg_stat_database_numbackends / pg_settings_max_connections) > 0.9
|
|
for: 2m
|
|
labels:
|
|
severity: critical
|
|
component: postgres
|
|
annotations:
|
|
summary: "Database connection pool almost exhausted"
|
|
description: "{{ $labels.datname }} is using {{ $value | humanizePercentage }} of max connections"
|
|
runbook: "https://wiki.erp-generic.com/runbooks/connection-pool-exhausted"
|
|
|
|
# Slow Queries
|
|
- alert: SlowQueries
|
|
expr: |
|
|
rate(pg_stat_statements_mean_exec_time[5m]) > 1000
|
|
for: 10m
|
|
labels:
|
|
severity: warning
|
|
component: postgres
|
|
annotations:
|
|
summary: "Slow database queries detected"
|
|
description: "Mean query execution time is {{ $value }}ms (threshold: 1000ms)"
|
|
runbook: "https://wiki.erp-generic.com/runbooks/slow-queries"
|
|
|
|
# High Number of Deadlocks
|
|
- alert: HighDeadlocks
|
|
expr: |
|
|
rate(pg_stat_database_deadlocks[5m]) > 5
|
|
for: 5m
|
|
labels:
|
|
severity: warning
|
|
component: postgres
|
|
annotations:
|
|
summary: "High number of database deadlocks"
|
|
description: "Deadlock rate is {{ $value }}/s (threshold: 5/s)"
|
|
|
|
# Replication Lag (if using replicas)
|
|
- alert: ReplicationLag
|
|
expr: |
|
|
pg_replication_lag_seconds > 60
|
|
for: 5m
|
|
labels:
|
|
severity: warning
|
|
component: postgres
|
|
annotations:
|
|
summary: "Database replication lag is high"
|
|
description: "Replication lag is {{ $value }}s (threshold: 60s)"
|
|
|
|
# Disk Space Low (Database)
|
|
- alert: DatabaseDiskSpaceLow
|
|
expr: |
|
|
(node_filesystem_avail_bytes{mountpoint="/var/lib/postgresql"} / node_filesystem_size_bytes{mountpoint="/var/lib/postgresql"}) < 0.15
|
|
for: 5m
|
|
labels:
|
|
severity: critical
|
|
component: postgres
|
|
annotations:
|
|
summary: "Database disk space is low"
|
|
description: "Only {{ $value | humanizePercentage }} disk space remaining"
|
|
runbook: "https://wiki.erp-generic.com/runbooks/database-disk-full"
|
|
```
|
|
|
|
### 5.4 Business Alert Rules
|
|
|
|
**File:** `prometheus/alerts/business.yml`
|
|
|
|
```yaml
|
|
groups:
|
|
- name: erp_business_alerts
|
|
interval: 1m
|
|
rules:
|
|
# No Sales Orders Created (Business Hours)
|
|
- alert: NoSalesOrdersCreated
|
|
expr: |
|
|
increase(erp_sales_orders_created_total[1h]) == 0
|
|
and ON() hour() >= 9 and ON() hour() < 18
|
|
for: 1h
|
|
labels:
|
|
severity: warning
|
|
component: business
|
|
annotations:
|
|
summary: "No sales orders created in the last hour during business hours"
|
|
description: "This might indicate a problem with the order creation system"
|
|
|
|
# High Order Cancellation Rate
|
|
- alert: HighOrderCancellationRate
|
|
expr: |
|
|
(rate(erp_sales_orders_created_total{status="cancelled"}[1h]) / rate(erp_sales_orders_created_total[1h])) > 0.2
|
|
for: 30m
|
|
labels:
|
|
severity: warning
|
|
component: business
|
|
annotations:
|
|
summary: "High order cancellation rate"
|
|
description: "{{ $value | humanizePercentage }} of orders are being cancelled (threshold: 20%)"
|
|
|
|
# Failed Login Spike
|
|
- alert: FailedLoginSpike
|
|
expr: |
|
|
rate(erp_login_failures_total[5m]) > 10
|
|
for: 5m
|
|
labels:
|
|
severity: warning
|
|
component: security
|
|
annotations:
|
|
summary: "Spike in failed login attempts"
|
|
description: "{{ $value }} failed logins per second (threshold: 10/s). Possible brute-force attack."
|
|
runbook: "https://wiki.erp-generic.com/runbooks/brute-force-attack"
|
|
```
|
|
|
|
---
|
|
|
|
## 6. LOGGING STRATEGY
|
|
|
|
### 6.1 Winston Configuration
|
|
|
|
**File:** `backend/src/common/logger/logger.service.ts`
|
|
|
|
```typescript
|
|
import { Injectable, LoggerService as NestLoggerService } from '@nestjs/common';
|
|
import * as winston from 'winston';
|
|
import 'winston-daily-rotate-file';
|
|
|
|
@Injectable()
|
|
export class LoggerService implements NestLoggerService {
|
|
private logger: winston.Logger;
|
|
|
|
constructor() {
|
|
this.logger = winston.createLogger({
|
|
level: process.env.LOG_LEVEL || 'info',
|
|
format: winston.format.combine(
|
|
winston.format.timestamp({ format: 'YYYY-MM-DD HH:mm:ss' }),
|
|
winston.format.errors({ stack: true }),
|
|
winston.format.splat(),
|
|
winston.format.json(),
|
|
),
|
|
defaultMeta: {
|
|
service: 'erp-generic-backend',
|
|
environment: process.env.NODE_ENV,
|
|
},
|
|
transports: [
|
|
// Console transport (for development)
|
|
new winston.transports.Console({
|
|
format: winston.format.combine(
|
|
winston.format.colorize(),
|
|
winston.format.printf(({ timestamp, level, message, context, ...meta }) => {
|
|
return `${timestamp} [${level}] [${context || 'Application'}] ${message} ${
|
|
Object.keys(meta).length ? JSON.stringify(meta, null, 2) : ''
|
|
}`;
|
|
}),
|
|
),
|
|
}),
|
|
|
|
// File transport - All logs
|
|
new winston.transports.DailyRotateFile({
|
|
filename: 'logs/application-%DATE%.log',
|
|
datePattern: 'YYYY-MM-DD',
|
|
maxSize: '20m',
|
|
maxFiles: '14d',
|
|
zippedArchive: true,
|
|
}),
|
|
|
|
// File transport - Error logs only
|
|
new winston.transports.DailyRotateFile({
|
|
level: 'error',
|
|
filename: 'logs/error-%DATE%.log',
|
|
datePattern: 'YYYY-MM-DD',
|
|
maxSize: '20m',
|
|
maxFiles: '30d',
|
|
zippedArchive: true,
|
|
}),
|
|
|
|
// File transport - Audit logs (security events)
|
|
new winston.transports.DailyRotateFile({
|
|
filename: 'logs/audit-%DATE%.log',
|
|
datePattern: 'YYYY-MM-DD',
|
|
maxSize: '50m',
|
|
maxFiles: '90d', // Keep for 90 days (compliance)
|
|
zippedArchive: true,
|
|
}),
|
|
],
|
|
});
|
|
|
|
// Add Elasticsearch/Loki transport for production
|
|
if (process.env.NODE_ENV === 'production') {
|
|
// Example: Winston-Elasticsearch
|
|
// this.logger.add(new WinstonElasticsearch({
|
|
// level: 'info',
|
|
// clientOpts: {
|
|
// node: process.env.ELASTICSEARCH_URL,
|
|
// auth: {
|
|
// username: process.env.ELASTICSEARCH_USER,
|
|
// password: process.env.ELASTICSEARCH_PASSWORD,
|
|
// },
|
|
// },
|
|
// index: 'erp-generic-logs',
|
|
// }));
|
|
}
|
|
}
|
|
|
|
log(message: string, context?: string, meta?: any) {
|
|
this.logger.info(message, { context, ...meta });
|
|
}
|
|
|
|
error(message: string, trace?: string, context?: string, meta?: any) {
|
|
this.logger.error(message, { trace, context, ...meta });
|
|
}
|
|
|
|
warn(message: string, context?: string, meta?: any) {
|
|
this.logger.warn(message, { context, ...meta });
|
|
}
|
|
|
|
debug(message: string, context?: string, meta?: any) {
|
|
this.logger.debug(message, { context, ...meta });
|
|
}
|
|
|
|
verbose(message: string, context?: string, meta?: any) {
|
|
this.logger.verbose(message, { context, ...meta });
|
|
}
|
|
|
|
// Audit logging (security-sensitive events)
|
|
audit(event: string, userId: string, tenantId: string, details: any) {
|
|
this.logger.info('AUDIT_EVENT', {
|
|
event,
|
|
userId,
|
|
tenantId,
|
|
details,
|
|
timestamp: new Date().toISOString(),
|
|
ip: details.ip,
|
|
userAgent: details.userAgent,
|
|
});
|
|
}
|
|
}
|
|
```
|
|
|
|
### 6.2 Structured Logging Examples
|
|
|
|
```typescript
|
|
// Login attempt
|
|
logger.audit('USER_LOGIN', userId, tenantId, {
|
|
method: 'email',
|
|
ip: request.ip,
|
|
userAgent: request.headers['user-agent'],
|
|
success: true,
|
|
});
|
|
|
|
// Database query
|
|
logger.debug('DB_QUERY', 'DatabaseService', {
|
|
operation: 'SELECT',
|
|
table: 'auth.users',
|
|
duration: 45, // ms
|
|
rowCount: 1,
|
|
});
|
|
|
|
// API request
|
|
logger.info('HTTP_REQUEST', 'HttpMiddleware', {
|
|
method: 'POST',
|
|
path: '/api/sales/orders',
|
|
statusCode: 201,
|
|
duration: 234, // ms
|
|
userId: '123e4567-e89b-12d3-a456-426614174000',
|
|
tenantId: 'tenant-abc',
|
|
});
|
|
|
|
// Error with stack trace
|
|
logger.error('ORDER_CREATION_FAILED', error.stack, 'OrderService', {
|
|
orderId: '123',
|
|
tenantId: 'tenant-abc',
|
|
error: error.message,
|
|
});
|
|
```
|
|
|
|
### 6.3 Log Aggregation (ELK Stack)
|
|
|
|
**Docker Compose for ELK Stack:**
|
|
|
|
```yaml
|
|
version: '3.9'
|
|
|
|
services:
|
|
elasticsearch:
|
|
image: docker.elastic.co/elasticsearch/elasticsearch:8.10.0
|
|
container_name: erp-elasticsearch
|
|
environment:
|
|
- discovery.type=single-node
|
|
- ES_JAVA_OPTS=-Xms2g -Xmx2g
|
|
- xpack.security.enabled=false
|
|
volumes:
|
|
- elasticsearch_data:/usr/share/elasticsearch/data
|
|
ports:
|
|
- "9200:9200"
|
|
networks:
|
|
- monitoring
|
|
restart: always
|
|
|
|
logstash:
|
|
image: docker.elastic.co/logstash/logstash:8.10.0
|
|
container_name: erp-logstash
|
|
volumes:
|
|
- ./logstash/logstash.conf:/usr/share/logstash/pipeline/logstash.conf:ro
|
|
ports:
|
|
- "5044:5044"
|
|
environment:
|
|
LS_JAVA_OPTS: "-Xmx512m -Xms512m"
|
|
networks:
|
|
- monitoring
|
|
depends_on:
|
|
- elasticsearch
|
|
restart: always
|
|
|
|
kibana:
|
|
image: docker.elastic.co/kibana/kibana:8.10.0
|
|
container_name: erp-kibana
|
|
ports:
|
|
- "5601:5601"
|
|
environment:
|
|
ELASTICSEARCH_URL: http://elasticsearch:9200
|
|
ELASTICSEARCH_HOSTS: '["http://elasticsearch:9200"]'
|
|
networks:
|
|
- monitoring
|
|
depends_on:
|
|
- elasticsearch
|
|
restart: always
|
|
|
|
volumes:
|
|
elasticsearch_data:
|
|
|
|
networks:
|
|
monitoring:
|
|
external: true
|
|
name: erp-monitoring
|
|
```
|
|
|
|
**Logstash Configuration:**
|
|
|
|
```conf
|
|
input {
|
|
file {
|
|
path => "/var/log/erp-generic/application-*.log"
|
|
type => "application"
|
|
codec => json
|
|
start_position => "beginning"
|
|
}
|
|
|
|
file {
|
|
path => "/var/log/erp-generic/error-*.log"
|
|
type => "error"
|
|
codec => json
|
|
start_position => "beginning"
|
|
}
|
|
|
|
file {
|
|
path => "/var/log/erp-generic/audit-*.log"
|
|
type => "audit"
|
|
codec => json
|
|
start_position => "beginning"
|
|
}
|
|
}
|
|
|
|
filter {
|
|
# Parse timestamp
|
|
date {
|
|
match => [ "timestamp", "ISO8601" ]
|
|
target => "@timestamp"
|
|
}
|
|
|
|
# Add geoip for IP addresses
|
|
if [ip] {
|
|
geoip {
|
|
source => "ip"
|
|
target => "geoip"
|
|
}
|
|
}
|
|
|
|
# Extract tenant_id as a field
|
|
if [tenantId] {
|
|
mutate {
|
|
add_field => { "tenant" => "%{tenantId}" }
|
|
}
|
|
}
|
|
}
|
|
|
|
output {
|
|
elasticsearch {
|
|
hosts => ["elasticsearch:9200"]
|
|
index => "erp-generic-logs-%{+YYYY.MM.dd}"
|
|
}
|
|
|
|
# Debug output (optional)
|
|
stdout {
|
|
codec => rubydebug
|
|
}
|
|
}
|
|
```
|
|
|
|
---
|
|
|
|
## 7. APPLICATION PERFORMANCE MONITORING (APM)
|
|
|
|
### 7.1 Custom Metrics Endpoints
|
|
|
|
**File:** `backend/src/metrics/metrics.controller.ts`
|
|
|
|
```typescript
|
|
import { Controller, Get } from '@nestjs/common';
|
|
import { MetricsService } from '../common/metrics/metrics.service';
|
|
import { PrismaService } from '../common/prisma/prisma.service';
|
|
|
|
@Controller('metrics')
|
|
export class MetricsController {
|
|
constructor(
|
|
private metricsService: MetricsService,
|
|
private prisma: PrismaService,
|
|
) {}
|
|
|
|
@Get()
|
|
getMetrics() {
|
|
return this.metricsService.getMetrics();
|
|
}
|
|
|
|
@Get('business')
|
|
async getBusinessMetrics() {
|
|
// Aggregate business metrics from database
|
|
const [salesOrders, purchaseOrders, invoices, activeUsers] = await Promise.all([
|
|
this.prisma.salesOrder.count(),
|
|
this.prisma.purchaseOrder.count(),
|
|
this.prisma.invoice.count(),
|
|
this.prisma.user.count({ where: { status: 'active' } }),
|
|
]);
|
|
|
|
return {
|
|
sales_orders_total: salesOrders,
|
|
purchase_orders_total: purchaseOrders,
|
|
invoices_total: invoices,
|
|
active_users_total: activeUsers,
|
|
};
|
|
}
|
|
}
|
|
```
|
|
|
|
### 7.2 Performance Profiling
|
|
|
|
**Prisma Query Logging:**
|
|
|
|
```typescript
|
|
// prisma/prisma.service.ts
|
|
import { Injectable, OnModuleInit } from '@nestjs/common';
|
|
import { PrismaClient } from '@prisma/client';
|
|
import { LoggerService } from '../logger/logger.service';
|
|
|
|
@Injectable()
|
|
export class PrismaService extends PrismaClient implements OnModuleInit {
|
|
constructor(private logger: LoggerService) {
|
|
super({
|
|
log: [
|
|
{ emit: 'event', level: 'query' },
|
|
{ emit: 'event', level: 'error' },
|
|
{ emit: 'event', level: 'warn' },
|
|
],
|
|
});
|
|
|
|
// Log slow queries (>100ms)
|
|
this.$on('query' as never, (e: any) => {
|
|
if (e.duration > 100) {
|
|
this.logger.warn('SLOW_QUERY', 'PrismaService', {
|
|
query: e.query,
|
|
duration: e.duration,
|
|
params: e.params,
|
|
});
|
|
}
|
|
});
|
|
|
|
// Log query errors
|
|
this.$on('error' as never, (e: any) => {
|
|
this.logger.error('DB_ERROR', e.message, 'PrismaService', {
|
|
target: e.target,
|
|
});
|
|
});
|
|
}
|
|
|
|
async onModuleInit() {
|
|
await this.$connect();
|
|
}
|
|
}
|
|
```
|
|
|
|
---
|
|
|
|
## 8. HEALTH CHECKS
|
|
|
|
### 8.1 Health Check Endpoints
|
|
|
|
```typescript
|
|
// health/health.controller.ts
|
|
import { Controller, Get } from '@nestjs/common';
|
|
import { HealthCheck, HealthCheckService, PrismaHealthIndicator, MemoryHealthIndicator, DiskHealthIndicator } from '@nestjs/terminus';
|
|
import { RedisHealthIndicator } from './redis.health';
|
|
|
|
@Controller('health')
|
|
export class HealthController {
|
|
constructor(
|
|
private health: HealthCheckService,
|
|
private db: PrismaHealthIndicator,
|
|
private redis: RedisHealthIndicator,
|
|
private memory: MemoryHealthIndicator,
|
|
private disk: DiskHealthIndicator,
|
|
) {}
|
|
|
|
@Get()
|
|
@HealthCheck()
|
|
check() {
|
|
return this.health.check([
|
|
() => this.db.pingCheck('database', { timeout: 3000 }),
|
|
() => this.redis.isHealthy('redis'),
|
|
() => this.memory.checkHeap('memory_heap', 200 * 1024 * 1024),
|
|
() => this.disk.checkStorage('disk', { path: '/', thresholdPercent: 0.9 }),
|
|
]);
|
|
}
|
|
|
|
@Get('live')
|
|
liveness() {
|
|
return { status: 'ok', timestamp: new Date().toISOString() };
|
|
}
|
|
|
|
@Get('ready')
|
|
@HealthCheck()
|
|
readiness() {
|
|
return this.health.check([
|
|
() => this.db.pingCheck('database'),
|
|
() => this.redis.isHealthy('redis'),
|
|
]);
|
|
}
|
|
}
|
|
```
|
|
|
|
---
|
|
|
|
## 9. DISTRIBUTED TRACING
|
|
|
|
### 9.1 OpenTelemetry Setup
|
|
|
|
```typescript
|
|
// tracing.ts (Bootstrap file)
|
|
import { NodeSDK } from '@opentelemetry/sdk-node';
|
|
import { getNodeAutoInstrumentations } from '@opentelemetry/auto-instrumentations-node';
|
|
import { JaegerExporter } from '@opentelemetry/exporter-jaeger';
|
|
|
|
const sdk = new NodeSDK({
|
|
traceExporter: new JaegerExporter({
|
|
endpoint: process.env.JAEGER_ENDPOINT || 'http://jaeger:14268/api/traces',
|
|
}),
|
|
instrumentations: [getNodeAutoInstrumentations()],
|
|
});
|
|
|
|
sdk.start();
|
|
```
|
|
|
|
---
|
|
|
|
## 10. ON-CALL & INCIDENT RESPONSE
|
|
|
|
### 10.1 On-Call Rotation
|
|
|
|
- **Primary On-Call:** DevOps Engineer (24/7)
|
|
- **Secondary On-Call:** Backend Lead
|
|
- **Escalation Path:** CTO → CEO
|
|
|
|
### 10.2 Incident Severity
|
|
|
|
| Severity | Response Time | Examples |
|
|
|----------|---------------|----------|
|
|
| **P0 (Critical)** | 15 min | System down, data loss |
|
|
| **P1 (High)** | 1 hour | Major feature broken |
|
|
| **P2 (Medium)** | 4 hours | Minor feature broken |
|
|
| **P3 (Low)** | 24 hours | Cosmetic issue |
|
|
|
|
---
|
|
|
|
## 11. REFERENCES
|
|
|
|
- [Deployment Guide](./DEPLOYMENT-GUIDE.md)
|
|
- [Prometheus Documentation](https://prometheus.io/docs/)
|
|
- [Grafana Documentation](https://grafana.com/docs/)
|
|
- [Google SRE Book](https://sre.google/sre-book/table-of-contents/)
|
|
|
|
---
|
|
|
|
**Documento:** MONITORING-OBSERVABILITY.md
|
|
**Versión:** 1.0
|
|
**Total Páginas:** ~18
|
|
**Última Actualización:** 2025-11-24
|