177 lines
7.0 KiB
Markdown
177 lines
7.0 KiB
Markdown
# Local LLM Agent
|
|
|
|
Gateway de LLM local para el workspace-v2. Permite a los agentes (Claude Code, Trae, Gemini) delegar tareas simples para ahorrar contexto y tokens.
|
|
|
|
## Arquitectura
|
|
|
|
```
|
|
┌─────────────────────────────────────────────────────────────────┐
|
|
│ AGENTES EXTERNOS │
|
|
│ Claude Code (Orquestador) │ Trae (Ejecutor) │ Gemini (QA) │
|
|
└─────────────────────────────┬───────────────────────────────────┘
|
|
│
|
|
▼
|
|
┌─────────────────────────────────────────────────────────────────┐
|
|
│ LOCAL-LLM-AGENT (Puerto 3160) │
|
|
│ ┌───────────────────────────────────────────────────────────┐ │
|
|
│ │ API Gateway (NestJS) - OpenAI Compatible │ │
|
|
│ │ POST /v1/chat/completions │ POST /mcp/tools/:name │ │
|
|
│ │ GET/POST /v1/lora/* │ │
|
|
│ └───────────────────────────────────────────────────────────┘ │
|
|
│ │ │
|
|
│ ┌───────────────────────────┴───────────────────────────────┐ │
|
|
│ │ Router Service │ │
|
|
│ │ - Tier Classification (small/main) │ │
|
|
│ │ - Project Detection with Confidence Scoring │ │
|
|
│ │ - LoRA Adapter Mapping │ │
|
|
│ └───────────────────────────────────────────────────────────┘ │
|
|
│ │ │
|
|
│ ┌───────────────────────────┴───────────────────────────────┐ │
|
|
│ │ Inference Engine (Python FastAPI) │ │
|
|
│ │ - Ollama Backend (CPU, desarrollo) │ │
|
|
│ │ - vLLM Backend (GPU, produccion) │ │
|
|
│ │ - Multi-LoRA Support │ │
|
|
│ │ - Continuous Batching │ │
|
|
│ └───────────────────────────────────────────────────────────┘ │
|
|
│ │ │
|
|
│ ┌───────────────────────────┴───────────────────────────────┐ │
|
|
│ │ Monitoring (Prometheus + Grafana) │ │
|
|
│ └───────────────────────────────────────────────────────────┘ │
|
|
└─────────────────────────────────────────────────────────────────┘
|
|
```
|
|
|
|
## Quick Start
|
|
|
|
```bash
|
|
# Desarrollo (CPU con Ollama)
|
|
docker-compose up -d
|
|
|
|
# Produccion (GPU con vLLM)
|
|
./scripts/setup-wsl-gpu.sh # Configurar GPU (una vez)
|
|
docker-compose -f docker-compose.prod.yml up -d
|
|
|
|
# Solo vLLM para desarrollo
|
|
docker-compose -f docker-compose.vllm.yml up -d
|
|
|
|
# Monitoring stack
|
|
docker-compose -f docker-compose.monitoring.yml up -d
|
|
```
|
|
|
|
## Servicios
|
|
|
|
| Servicio | Puerto | Descripcion |
|
|
|----------|--------|-------------|
|
|
| Gateway API | 3160 | API Gateway OpenAI-compatible |
|
|
| Inference Engine | 3161 | Servicio de inferencia Python |
|
|
| Ollama Backend | 11434 | Backend CPU (desarrollo) |
|
|
| vLLM Backend | 8000 | Backend GPU (produccion) |
|
|
| Prometheus | 9090 | Metricas |
|
|
| Grafana | 3000 | Dashboard (admin/admin) |
|
|
|
|
## APIs
|
|
|
|
### OpenAI-Compatible
|
|
|
|
```bash
|
|
# Chat completion
|
|
curl -X POST http://localhost:3160/v1/chat/completions \
|
|
-H "Content-Type: application/json" \
|
|
-d '{
|
|
"model": "gpt-oss-20b",
|
|
"messages": [{"role": "user", "content": "Hello"}]
|
|
}'
|
|
|
|
# List models
|
|
curl http://localhost:3160/v1/models
|
|
```
|
|
|
|
### MCP Tools
|
|
|
|
```bash
|
|
# Classify text
|
|
curl -X POST http://localhost:3160/mcp/tools/classify \
|
|
-H "Content-Type: application/json" \
|
|
-d '{
|
|
"input": "Fix bug in login",
|
|
"categories": ["bug", "feature", "refactor"]
|
|
}'
|
|
|
|
# Extract structured data
|
|
curl -X POST http://localhost:3160/mcp/tools/extract \
|
|
-H "Content-Type: application/json" \
|
|
-d '{
|
|
"input": "John is 30 years old and works as engineer",
|
|
"schema": {"name": "string", "age": "number", "job": "string"}
|
|
}'
|
|
```
|
|
|
|
### LoRA Management
|
|
|
|
```bash
|
|
# List adapters
|
|
curl http://localhost:3160/v1/lora/adapters
|
|
|
|
# Get adapter status
|
|
curl http://localhost:3160/v1/lora/status
|
|
|
|
# View project mappings
|
|
curl http://localhost:3160/v1/lora/mappings
|
|
```
|
|
|
|
### Metrics (Prometheus)
|
|
|
|
```bash
|
|
# Get metrics
|
|
curl http://localhost:3161/metrics
|
|
```
|
|
|
|
## Configuracion
|
|
|
|
Ver `.env.example` para variables de entorno disponibles.
|
|
Ver `.env.production` para template de produccion.
|
|
|
|
## GPU Setup (WSL)
|
|
|
|
Para usar vLLM con GPU en WSL:
|
|
|
|
```bash
|
|
./scripts/setup-wsl-gpu.sh
|
|
```
|
|
|
|
Ver [WSL-GPU-SETUP.md](docs/70-onboarding/WSL-GPU-SETUP.md) para mas detalles.
|
|
|
|
## Documentacion
|
|
|
|
- [Arquitectura](docs/00-vision-general/ARQUITECTURA-LOCAL-LLM.md)
|
|
- [WSL GPU Setup](docs/70-onboarding/WSL-GPU-SETUP.md)
|
|
- [ADR-001: Runtime Selection](docs/90-adr/ADR-001-runtime-selection.md)
|
|
- [ADR-002: Model Selection](docs/90-adr/ADR-002-model-selection.md)
|
|
|
|
## Version
|
|
|
|
- **Version:** 0.6.0
|
|
- **Status:** Production Ready (Phase 3 complete)
|
|
- **Prioridad:** P1 (Infraestructura de soporte)
|
|
|
|
## Changelog
|
|
|
|
### v0.6.0 (Phase 3 - Production)
|
|
- vLLM backend with GPU support
|
|
- Multi-LoRA adapters per project
|
|
- Prometheus metrics endpoint
|
|
- Grafana dashboard
|
|
- Continuous batching
|
|
- Project detection with confidence scoring
|
|
- Production docker-compose
|
|
|
|
### v0.5.0 (Phase 2 - MCP + Rate Limiting)
|
|
- MCP Tools (classify, extract, summarize, qa)
|
|
- Rate limiting per tier
|
|
- Basic project detection
|
|
|
|
### v0.1.0 (Phase 1 - MVP)
|
|
- Gateway NestJS
|
|
- Inference Engine Python
|
|
- Ollama backend
|
|
- OpenAI-compatible API
|