local-llm-agent/README.md

# Local LLM Agent

Gateway de LLM local para el workspace-v2. Permite a los agentes (Claude Code, Trae, Gemini) delegar tareas simples para ahorrar contexto y tokens.

## Arquitectura

```
┌─────────────────────────────────────────────────────────────────┐
│                    AGENTES EXTERNOS                              │
│   Claude Code (Orquestador) │ Trae (Ejecutor) │ Gemini (QA)     │
└─────────────────────────────┬───────────────────────────────────┘
                              │
                              ▼
┌─────────────────────────────────────────────────────────────────┐
│              LOCAL-LLM-AGENT (Puerto 3160)                      │
│  ┌───────────────────────────────────────────────────────────┐  │
│  │           API Gateway (NestJS) - OpenAI Compatible        │  │
│  │  POST /v1/chat/completions  │  POST /mcp/tools/:name      │  │
│  │  GET/POST /v1/lora/*                                      │  │
│  └───────────────────────────────────────────────────────────┘  │
│                              │                                   │
│  ┌───────────────────────────┴───────────────────────────────┐  │
│  │              Router Service                                │  │
│  │  - Tier Classification (small/main)                       │  │
│  │  - Project Detection with Confidence Scoring              │  │
│  │  - LoRA Adapter Mapping                                   │  │
│  └───────────────────────────────────────────────────────────┘  │
│                              │                                   │
│  ┌───────────────────────────┴───────────────────────────────┐  │
│  │           Inference Engine (Python FastAPI)               │  │
│  │  - Ollama Backend (CPU, desarrollo)                       │  │
│  │  - vLLM Backend (GPU, produccion)                         │  │
│  │  - Multi-LoRA Support                                     │  │
│  │  - Continuous Batching                                    │  │
│  └───────────────────────────────────────────────────────────┘  │
│                              │                                   │
│  ┌───────────────────────────┴───────────────────────────────┐  │
│  │           Monitoring (Prometheus + Grafana)               │  │
│  └───────────────────────────────────────────────────────────┘  │
└─────────────────────────────────────────────────────────────────┘
```

## Quick Start

```bash
# Desarrollo (CPU con Ollama)
docker-compose up -d

# Produccion (GPU con vLLM)
./scripts/setup-wsl-gpu.sh   # Configurar GPU (una vez)
docker-compose -f docker-compose.prod.yml up -d

# Solo vLLM para desarrollo
docker-compose -f docker-compose.vllm.yml up -d

# Monitoring stack
docker-compose -f docker-compose.monitoring.yml up -d
```

## Servicios

| Servicio | Puerto | Descripcion |
|----------|--------|-------------|
| Gateway API | 3160 | API Gateway OpenAI-compatible |
| Inference Engine | 3161 | Servicio de inferencia Python |
| Ollama Backend | 11434 | Backend CPU (desarrollo) |
| vLLM Backend | 8000 | Backend GPU (produccion) |
| Prometheus | 9090 | Metricas |
| Grafana | 3000 | Dashboard (admin/admin) |

## APIs

### OpenAI-Compatible

```bash
# Chat completion
curl -X POST http://localhost:3160/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "gpt-oss-20b",
    "messages": [{"role": "user", "content": "Hello"}]
  }'

# List models
curl http://localhost:3160/v1/models
```

### MCP Tools

```bash
# Classify text
curl -X POST http://localhost:3160/mcp/tools/classify \
  -H "Content-Type: application/json" \
  -d '{
    "input": "Fix bug in login",
    "categories": ["bug", "feature", "refactor"]
  }'

# Extract structured data
curl -X POST http://localhost:3160/mcp/tools/extract \
  -H "Content-Type: application/json" \
  -d '{
    "input": "John is 30 years old and works as engineer",
    "schema": {"name": "string", "age": "number", "job": "string"}
  }'
```

### LoRA Management

```bash
# List adapters
curl http://localhost:3160/v1/lora/adapters

# Get adapter status
curl http://localhost:3160/v1/lora/status

# View project mappings
curl http://localhost:3160/v1/lora/mappings
```

### Metrics (Prometheus)

```bash
# Get metrics
curl http://localhost:3161/metrics
```

## Configuracion

Ver `.env.example` para variables de entorno disponibles.
Ver `.env.production` para template de produccion.

## GPU Setup (WSL)

Para usar vLLM con GPU en WSL:

```bash
./scripts/setup-wsl-gpu.sh
```

Ver [WSL-GPU-SETUP.md](docs/70-onboarding/WSL-GPU-SETUP.md) para mas detalles.

## Documentacion

- [Arquitectura](docs/00-vision-general/ARQUITECTURA-LOCAL-LLM.md)
- [WSL GPU Setup](docs/70-onboarding/WSL-GPU-SETUP.md)
- [ADR-001: Runtime Selection](docs/90-adr/ADR-001-runtime-selection.md)
- [ADR-002: Model Selection](docs/90-adr/ADR-002-model-selection.md)

## Version

- **Version:** 0.6.0
- **Status:** Production Ready (Phase 3 complete)
- **Prioridad:** P1 (Infraestructura de soporte)

## Changelog

### v0.6.0 (Phase 3 - Production)
- vLLM backend with GPU support
- Multi-LoRA adapters per project
- Prometheus metrics endpoint
- Grafana dashboard
- Continuous batching
- Project detection with confidence scoring
- Production docker-compose

### v0.5.0 (Phase 2 - MCP + Rate Limiting)
- MCP Tools (classify, extract, summarize, qa)
- Rate limiting per tier
- Basic project detection

### v0.1.0 (Phase 1 - MVP)
- Gateway NestJS
- Inference Engine Python
- Ollama backend
- OpenAI-compatible API