Adrian Flores Cortes 3def230d58 Initial commit: local-llm-agent infrastructure project

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

2026-02-02 16:42:45 -06:00

7.0 KiB

Raw Blame History

Local LLM Agent

Gateway de LLM local para el workspace-v2. Permite a los agentes (Claude Code, Trae, Gemini) delegar tareas simples para ahorrar contexto y tokens.

Arquitectura

┌─────────────────────────────────────────────────────────────────┐
│                    AGENTES EXTERNOS                              │
│   Claude Code (Orquestador) │ Trae (Ejecutor) │ Gemini (QA)     │
└─────────────────────────────┬───────────────────────────────────┘
                              │
                              ▼
┌─────────────────────────────────────────────────────────────────┐
│              LOCAL-LLM-AGENT (Puerto 3160)                      │
│  ┌───────────────────────────────────────────────────────────┐  │
│  │           API Gateway (NestJS) - OpenAI Compatible        │  │
│  │  POST /v1/chat/completions  │  POST /mcp/tools/:name      │  │
│  │  GET/POST /v1/lora/*                                      │  │
│  └───────────────────────────────────────────────────────────┘  │
│                              │                                   │
│  ┌───────────────────────────┴───────────────────────────────┐  │
│  │              Router Service                                │  │
│  │  - Tier Classification (small/main)                       │  │
│  │  - Project Detection with Confidence Scoring              │  │
│  │  - LoRA Adapter Mapping                                   │  │
│  └───────────────────────────────────────────────────────────┘  │
│                              │                                   │
│  ┌───────────────────────────┴───────────────────────────────┐  │
│  │           Inference Engine (Python FastAPI)               │  │
│  │  - Ollama Backend (CPU, desarrollo)                       │  │
│  │  - vLLM Backend (GPU, produccion)                         │  │
│  │  - Multi-LoRA Support                                     │  │
│  │  - Continuous Batching                                    │  │
│  └───────────────────────────────────────────────────────────┘  │
│                              │                                   │
│  ┌───────────────────────────┴───────────────────────────────┐  │
│  │           Monitoring (Prometheus + Grafana)               │  │
│  └───────────────────────────────────────────────────────────┘  │
└─────────────────────────────────────────────────────────────────┘

Quick Start

# Desarrollo (CPU con Ollama)
docker-compose up -d

# Produccion (GPU con vLLM)
./scripts/setup-wsl-gpu.sh   # Configurar GPU (una vez)
docker-compose -f docker-compose.prod.yml up -d

# Solo vLLM para desarrollo
docker-compose -f docker-compose.vllm.yml up -d

# Monitoring stack
docker-compose -f docker-compose.monitoring.yml up -d

Servicios

Servicio	Puerto	Descripcion
Gateway API	3160	API Gateway OpenAI-compatible
Inference Engine	3161	Servicio de inferencia Python
Ollama Backend	11434	Backend CPU (desarrollo)
vLLM Backend	8000	Backend GPU (produccion)
Prometheus	9090	Metricas
Grafana	3000	Dashboard (admin/admin)

APIs

OpenAI-Compatible

# Chat completion
curl -X POST http://localhost:3160/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "gpt-oss-20b",
    "messages": [{"role": "user", "content": "Hello"}]
  }'

# List models
curl http://localhost:3160/v1/models

MCP Tools

# Classify text
curl -X POST http://localhost:3160/mcp/tools/classify \
  -H "Content-Type: application/json" \
  -d '{
    "input": "Fix bug in login",
    "categories": ["bug", "feature", "refactor"]
  }'

# Extract structured data
curl -X POST http://localhost:3160/mcp/tools/extract \
  -H "Content-Type: application/json" \
  -d '{
    "input": "John is 30 years old and works as engineer",
    "schema": {"name": "string", "age": "number", "job": "string"}
  }'

LoRA Management

# List adapters
curl http://localhost:3160/v1/lora/adapters

# Get adapter status
curl http://localhost:3160/v1/lora/status

# View project mappings
curl http://localhost:3160/v1/lora/mappings

Metrics (Prometheus)

# Get metrics
curl http://localhost:3161/metrics

Configuracion

Ver .env.example para variables de entorno disponibles. Ver .env.production para template de produccion.

GPU Setup (WSL)

Para usar vLLM con GPU en WSL:

./scripts/setup-wsl-gpu.sh

Ver WSL-GPU-SETUP.md para mas detalles.

Documentacion

Version

Version: 0.6.0
Status: Production Ready (Phase 3 complete)
Prioridad: P1 (Infraestructura de soporte)

Changelog

v0.6.0 (Phase 3 - Production)

vLLM backend with GPU support
Multi-LoRA adapters per project
Prometheus metrics endpoint
Grafana dashboard
Continuous batching
Project detection with confidence scoring
Production docker-compose

v0.5.0 (Phase 2 - MCP + Rate Limiting)

MCP Tools (classify, extract, summarize, qa)
Rate limiting per tier
Basic project detection

v0.1.0 (Phase 1 - MVP)

Gateway NestJS
Inference Engine Python
Ollama backend
OpenAI-compatible API

7.0 KiB Raw Blame History