Local LLM Agent infrastructure with gateway and inference engine
|
|
||
|---|---|---|
| apps | ||
| config | ||
| docs | ||
| lora-adapters | ||
| monitoring | ||
| orchestration | ||
| scripts | ||
| .env.example | ||
| .env.ports | ||
| .env.production | ||
| .gitignore | ||
| CLAUDE.md | ||
| docker-compose.monitoring.yml | ||
| docker-compose.prod.yml | ||
| docker-compose.vllm.yml | ||
| docker-compose.yml | ||
| INVENTARIO.yml | ||
| README.md | ||
Local LLM Agent
Gateway de LLM local para el workspace-v2. Permite a los agentes (Claude Code, Trae, Gemini) delegar tareas simples para ahorrar contexto y tokens.
Arquitectura
┌─────────────────────────────────────────────────────────────────┐
│ AGENTES EXTERNOS │
│ Claude Code (Orquestador) │ Trae (Ejecutor) │ Gemini (QA) │
└─────────────────────────────┬───────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────┐
│ LOCAL-LLM-AGENT (Puerto 3160) │
│ ┌───────────────────────────────────────────────────────────┐ │
│ │ API Gateway (NestJS) - OpenAI Compatible │ │
│ │ POST /v1/chat/completions │ POST /mcp/tools/:name │ │
│ │ GET/POST /v1/lora/* │ │
│ └───────────────────────────────────────────────────────────┘ │
│ │ │
│ ┌───────────────────────────┴───────────────────────────────┐ │
│ │ Router Service │ │
│ │ - Tier Classification (small/main) │ │
│ │ - Project Detection with Confidence Scoring │ │
│ │ - LoRA Adapter Mapping │ │
│ └───────────────────────────────────────────────────────────┘ │
│ │ │
│ ┌───────────────────────────┴───────────────────────────────┐ │
│ │ Inference Engine (Python FastAPI) │ │
│ │ - Ollama Backend (CPU, desarrollo) │ │
│ │ - vLLM Backend (GPU, produccion) │ │
│ │ - Multi-LoRA Support │ │
│ │ - Continuous Batching │ │
│ └───────────────────────────────────────────────────────────┘ │
│ │ │
│ ┌───────────────────────────┴───────────────────────────────┐ │
│ │ Monitoring (Prometheus + Grafana) │ │
│ └───────────────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────────┘
Quick Start
# Desarrollo (CPU con Ollama)
docker-compose up -d
# Produccion (GPU con vLLM)
./scripts/setup-wsl-gpu.sh # Configurar GPU (una vez)
docker-compose -f docker-compose.prod.yml up -d
# Solo vLLM para desarrollo
docker-compose -f docker-compose.vllm.yml up -d
# Monitoring stack
docker-compose -f docker-compose.monitoring.yml up -d
Servicios
| Servicio | Puerto | Descripcion |
|---|---|---|
| Gateway API | 3160 | API Gateway OpenAI-compatible |
| Inference Engine | 3161 | Servicio de inferencia Python |
| Ollama Backend | 11434 | Backend CPU (desarrollo) |
| vLLM Backend | 8000 | Backend GPU (produccion) |
| Prometheus | 9090 | Metricas |
| Grafana | 3000 | Dashboard (admin/admin) |
APIs
OpenAI-Compatible
# Chat completion
curl -X POST http://localhost:3160/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "gpt-oss-20b",
"messages": [{"role": "user", "content": "Hello"}]
}'
# List models
curl http://localhost:3160/v1/models
MCP Tools
# Classify text
curl -X POST http://localhost:3160/mcp/tools/classify \
-H "Content-Type: application/json" \
-d '{
"input": "Fix bug in login",
"categories": ["bug", "feature", "refactor"]
}'
# Extract structured data
curl -X POST http://localhost:3160/mcp/tools/extract \
-H "Content-Type: application/json" \
-d '{
"input": "John is 30 years old and works as engineer",
"schema": {"name": "string", "age": "number", "job": "string"}
}'
LoRA Management
# List adapters
curl http://localhost:3160/v1/lora/adapters
# Get adapter status
curl http://localhost:3160/v1/lora/status
# View project mappings
curl http://localhost:3160/v1/lora/mappings
Metrics (Prometheus)
# Get metrics
curl http://localhost:3161/metrics
Configuracion
Ver .env.example para variables de entorno disponibles.
Ver .env.production para template de produccion.
GPU Setup (WSL)
Para usar vLLM con GPU en WSL:
./scripts/setup-wsl-gpu.sh
Ver WSL-GPU-SETUP.md para mas detalles.
Documentacion
Version
- Version: 0.6.0
- Status: Production Ready (Phase 3 complete)
- Prioridad: P1 (Infraestructura de soporte)
Changelog
v0.6.0 (Phase 3 - Production)
- vLLM backend with GPU support
- Multi-LoRA adapters per project
- Prometheus metrics endpoint
- Grafana dashboard
- Continuous batching
- Project detection with confidence scoring
- Production docker-compose
v0.5.0 (Phase 2 - MCP + Rate Limiting)
- MCP Tools (classify, extract, summarize, qa)
- Rate limiting per tier
- Basic project detection
v0.1.0 (Phase 1 - MVP)
- Gateway NestJS
- Inference Engine Python
- Ollama backend
- OpenAI-compatible API