# Local LLM Agent Gateway de LLM local para el workspace-v2. Permite a los agentes (Claude Code, Trae, Gemini) delegar tareas simples para ahorrar contexto y tokens. ## Arquitectura ``` ┌─────────────────────────────────────────────────────────────────┐ │ AGENTES EXTERNOS │ │ Claude Code (Orquestador) │ Trae (Ejecutor) │ Gemini (QA) │ └─────────────────────────────┬───────────────────────────────────┘ │ ▼ ┌─────────────────────────────────────────────────────────────────┐ │ LOCAL-LLM-AGENT (Puerto 3160) │ │ ┌───────────────────────────────────────────────────────────┐ │ │ │ API Gateway (NestJS) - OpenAI Compatible │ │ │ │ POST /v1/chat/completions │ POST /mcp/tools/:name │ │ │ │ GET/POST /v1/lora/* │ │ │ └───────────────────────────────────────────────────────────┘ │ │ │ │ │ ┌───────────────────────────┴───────────────────────────────┐ │ │ │ Router Service │ │ │ │ - Tier Classification (small/main) │ │ │ │ - Project Detection with Confidence Scoring │ │ │ │ - LoRA Adapter Mapping │ │ │ └───────────────────────────────────────────────────────────┘ │ │ │ │ │ ┌───────────────────────────┴───────────────────────────────┐ │ │ │ Inference Engine (Python FastAPI) │ │ │ │ - Ollama Backend (CPU, desarrollo) │ │ │ │ - vLLM Backend (GPU, produccion) │ │ │ │ - Multi-LoRA Support │ │ │ │ - Continuous Batching │ │ │ └───────────────────────────────────────────────────────────┘ │ │ │ │ │ ┌───────────────────────────┴───────────────────────────────┐ │ │ │ Monitoring (Prometheus + Grafana) │ │ │ └───────────────────────────────────────────────────────────┘ │ └─────────────────────────────────────────────────────────────────┘ ``` ## Quick Start ```bash # Desarrollo (CPU con Ollama) docker-compose up -d # Produccion (GPU con vLLM) ./scripts/setup-wsl-gpu.sh # Configurar GPU (una vez) docker-compose -f docker-compose.prod.yml up -d # Solo vLLM para desarrollo docker-compose -f docker-compose.vllm.yml up -d # Monitoring stack docker-compose -f docker-compose.monitoring.yml up -d ``` ## Servicios | Servicio | Puerto | Descripcion | |----------|--------|-------------| | Gateway API | 3160 | API Gateway OpenAI-compatible | | Inference Engine | 3161 | Servicio de inferencia Python | | Ollama Backend | 11434 | Backend CPU (desarrollo) | | vLLM Backend | 8000 | Backend GPU (produccion) | | Prometheus | 9090 | Metricas | | Grafana | 3000 | Dashboard (admin/admin) | ## APIs ### OpenAI-Compatible ```bash # Chat completion curl -X POST http://localhost:3160/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ "model": "gpt-oss-20b", "messages": [{"role": "user", "content": "Hello"}] }' # List models curl http://localhost:3160/v1/models ``` ### MCP Tools ```bash # Classify text curl -X POST http://localhost:3160/mcp/tools/classify \ -H "Content-Type: application/json" \ -d '{ "input": "Fix bug in login", "categories": ["bug", "feature", "refactor"] }' # Extract structured data curl -X POST http://localhost:3160/mcp/tools/extract \ -H "Content-Type: application/json" \ -d '{ "input": "John is 30 years old and works as engineer", "schema": {"name": "string", "age": "number", "job": "string"} }' ``` ### LoRA Management ```bash # List adapters curl http://localhost:3160/v1/lora/adapters # Get adapter status curl http://localhost:3160/v1/lora/status # View project mappings curl http://localhost:3160/v1/lora/mappings ``` ### Metrics (Prometheus) ```bash # Get metrics curl http://localhost:3161/metrics ``` ## Configuracion Ver `.env.example` para variables de entorno disponibles. Ver `.env.production` para template de produccion. ## GPU Setup (WSL) Para usar vLLM con GPU en WSL: ```bash ./scripts/setup-wsl-gpu.sh ``` Ver [WSL-GPU-SETUP.md](docs/70-onboarding/WSL-GPU-SETUP.md) para mas detalles. ## Documentacion - [Arquitectura](docs/00-vision-general/ARQUITECTURA-LOCAL-LLM.md) - [WSL GPU Setup](docs/70-onboarding/WSL-GPU-SETUP.md) - [ADR-001: Runtime Selection](docs/90-adr/ADR-001-runtime-selection.md) - [ADR-002: Model Selection](docs/90-adr/ADR-002-model-selection.md) ## Version - **Version:** 0.6.0 - **Status:** Production Ready (Phase 3 complete) - **Prioridad:** P1 (Infraestructura de soporte) ## Changelog ### v0.6.0 (Phase 3 - Production) - vLLM backend with GPU support - Multi-LoRA adapters per project - Prometheus metrics endpoint - Grafana dashboard - Continuous batching - Project detection with confidence scoring - Production docker-compose ### v0.5.0 (Phase 2 - MCP + Rate Limiting) - MCP Tools (classify, extract, summarize, qa) - Rate limiting per tier - Basic project detection ### v0.1.0 (Phase 1 - MVP) - Gateway NestJS - Inference Engine Python - Ollama backend - OpenAI-compatible API