local-llm-agent/README.md
Adrian Flores Cortes 3def230d58 Initial commit: local-llm-agent infrastructure project
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-02 16:42:45 -06:00

177 lines
7.0 KiB
Markdown

# Local LLM Agent
Gateway de LLM local para el workspace-v2. Permite a los agentes (Claude Code, Trae, Gemini) delegar tareas simples para ahorrar contexto y tokens.
## Arquitectura
```
┌─────────────────────────────────────────────────────────────────┐
│ AGENTES EXTERNOS │
│ Claude Code (Orquestador) │ Trae (Ejecutor) │ Gemini (QA) │
└─────────────────────────────┬───────────────────────────────────┘
┌─────────────────────────────────────────────────────────────────┐
│ LOCAL-LLM-AGENT (Puerto 3160) │
│ ┌───────────────────────────────────────────────────────────┐ │
│ │ API Gateway (NestJS) - OpenAI Compatible │ │
│ │ POST /v1/chat/completions │ POST /mcp/tools/:name │ │
│ │ GET/POST /v1/lora/* │ │
│ └───────────────────────────────────────────────────────────┘ │
│ │ │
│ ┌───────────────────────────┴───────────────────────────────┐ │
│ │ Router Service │ │
│ │ - Tier Classification (small/main) │ │
│ │ - Project Detection with Confidence Scoring │ │
│ │ - LoRA Adapter Mapping │ │
│ └───────────────────────────────────────────────────────────┘ │
│ │ │
│ ┌───────────────────────────┴───────────────────────────────┐ │
│ │ Inference Engine (Python FastAPI) │ │
│ │ - Ollama Backend (CPU, desarrollo) │ │
│ │ - vLLM Backend (GPU, produccion) │ │
│ │ - Multi-LoRA Support │ │
│ │ - Continuous Batching │ │
│ └───────────────────────────────────────────────────────────┘ │
│ │ │
│ ┌───────────────────────────┴───────────────────────────────┐ │
│ │ Monitoring (Prometheus + Grafana) │ │
│ └───────────────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────────┘
```
## Quick Start
```bash
# Desarrollo (CPU con Ollama)
docker-compose up -d
# Produccion (GPU con vLLM)
./scripts/setup-wsl-gpu.sh # Configurar GPU (una vez)
docker-compose -f docker-compose.prod.yml up -d
# Solo vLLM para desarrollo
docker-compose -f docker-compose.vllm.yml up -d
# Monitoring stack
docker-compose -f docker-compose.monitoring.yml up -d
```
## Servicios
| Servicio | Puerto | Descripcion |
|----------|--------|-------------|
| Gateway API | 3160 | API Gateway OpenAI-compatible |
| Inference Engine | 3161 | Servicio de inferencia Python |
| Ollama Backend | 11434 | Backend CPU (desarrollo) |
| vLLM Backend | 8000 | Backend GPU (produccion) |
| Prometheus | 9090 | Metricas |
| Grafana | 3000 | Dashboard (admin/admin) |
## APIs
### OpenAI-Compatible
```bash
# Chat completion
curl -X POST http://localhost:3160/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "gpt-oss-20b",
"messages": [{"role": "user", "content": "Hello"}]
}'
# List models
curl http://localhost:3160/v1/models
```
### MCP Tools
```bash
# Classify text
curl -X POST http://localhost:3160/mcp/tools/classify \
-H "Content-Type: application/json" \
-d '{
"input": "Fix bug in login",
"categories": ["bug", "feature", "refactor"]
}'
# Extract structured data
curl -X POST http://localhost:3160/mcp/tools/extract \
-H "Content-Type: application/json" \
-d '{
"input": "John is 30 years old and works as engineer",
"schema": {"name": "string", "age": "number", "job": "string"}
}'
```
### LoRA Management
```bash
# List adapters
curl http://localhost:3160/v1/lora/adapters
# Get adapter status
curl http://localhost:3160/v1/lora/status
# View project mappings
curl http://localhost:3160/v1/lora/mappings
```
### Metrics (Prometheus)
```bash
# Get metrics
curl http://localhost:3161/metrics
```
## Configuracion
Ver `.env.example` para variables de entorno disponibles.
Ver `.env.production` para template de produccion.
## GPU Setup (WSL)
Para usar vLLM con GPU en WSL:
```bash
./scripts/setup-wsl-gpu.sh
```
Ver [WSL-GPU-SETUP.md](docs/70-onboarding/WSL-GPU-SETUP.md) para mas detalles.
## Documentacion
- [Arquitectura](docs/00-vision-general/ARQUITECTURA-LOCAL-LLM.md)
- [WSL GPU Setup](docs/70-onboarding/WSL-GPU-SETUP.md)
- [ADR-001: Runtime Selection](docs/90-adr/ADR-001-runtime-selection.md)
- [ADR-002: Model Selection](docs/90-adr/ADR-002-model-selection.md)
## Version
- **Version:** 0.6.0
- **Status:** Production Ready (Phase 3 complete)
- **Prioridad:** P1 (Infraestructura de soporte)
## Changelog
### v0.6.0 (Phase 3 - Production)
- vLLM backend with GPU support
- Multi-LoRA adapters per project
- Prometheus metrics endpoint
- Grafana dashboard
- Continuous batching
- Project detection with confidence scoring
- Production docker-compose
### v0.5.0 (Phase 2 - MCP + Rate Limiting)
- MCP Tools (classify, extract, summarize, qa)
- Rate limiting per tier
- Basic project detection
### v0.1.0 (Phase 1 - MVP)
- Gateway NestJS
- Inference Engine Python
- Ollama backend
- OpenAI-compatible API