Adrian Flores Cortes 3def230d58 Initial commit: local-llm-agent infrastructure project

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

2026-02-02 16:42:45 -06:00

14 KiB

Raw Blame History

Requerimientos Funcionales - Local LLM Agent

Version: 1.0.0 Fecha: 2026-01-20 Proyecto: local-llm-agent Prioridad: P1 (Infraestructura) Status: En desarrollo

1. VISION GENERAL

1.1 Proposito del Sistema

Local LLM Agent es un gateway de LLM local que permite a los agentes del workspace (Claude Code, Trae, Gemini) delegar tareas simples para optimizar el uso de contexto y tokens en los modelos principales de pago.

1.2 Objetivos de Negocio

ID	Objetivo	Metrica de Exito
OBJ-001	Reducir consumo de tokens en modelos de pago	30% reduccion en tareas delegables
OBJ-002	Mantener latencia aceptable para tareas simples	< 2s para tier small, < 5s para tier main
OBJ-003	Proveer API compatible con estandar OpenAI	100% compatibilidad con endpoints basicos
OBJ-004	Soportar herramientas MCP especializadas	4 herramientas base implementadas

1.3 Stakeholders

Stakeholder	Rol	Interes
Agentes AI (Claude, Gemini, Trae)	Consumidores principales	API confiable y rapida
Desarrolladores workspace	Usuarios indirectos	Integracion transparente
Administrador de sistema	Operador	Monitoreo y mantenimiento

2. REQUERIMIENTOS FUNCIONALES

2.1 Modulo: API Gateway (NestJS)

RF-GW-001: Endpoint de Chat Completion OpenAI-Compatible

Atributo	Valor
ID	RF-GW-001
Nombre	Chat Completion API
Prioridad	MUST HAVE
Complejidad	Media
Dependencias	RF-IE-001

Descripcion: El sistema DEBE proveer un endpoint POST /v1/chat/completions que acepte requests en formato OpenAI y retorne respuestas en el mismo formato.

Criterios de Aceptacion:

Endpoint acepta Content-Type: application/json
Request body compatible con esquema OpenAI ChatCompletion
Response body compatible con esquema OpenAI ChatCompletionResponse
Soporta parametros: model, messages, max_tokens, temperature, top_p
Retorna usage con prompt_tokens, completion_tokens, total_tokens
Maneja errores con formato OpenAI error response

Request Schema:

interface ChatCompletionRequest {
  model: string;                    // Ej: "gpt-oss-20b"
  messages: Array<{
    role: "system" | "user" | "assistant";
    content: string;
  }>;
  max_tokens?: number;              // Default: 512
  temperature?: number;             // Default: 0.7
  top_p?: number;                   // Default: 0.9
  stream?: boolean;                 // Default: false (Fase 2)
}

Response Schema:

interface ChatCompletionResponse {
  id: string;                       // Ej: "chatcmpl-abc123"
  object: "chat.completion";
  created: number;                  // Unix timestamp
  model: string;
  choices: Array<{
    index: number;
    message: {
      role: "assistant";
      content: string;
    };
    finish_reason: "stop" | "length";
  }>;
  usage: {
    prompt_tokens: number;
    completion_tokens: number;
    total_tokens: number;
  };
}

RF-GW-002: Endpoint de Lista de Modelos

Atributo	Valor
ID	RF-GW-002
Nombre	List Models API
Prioridad	MUST HAVE
Complejidad	Baja
Dependencias	RF-IE-002

Descripcion: El sistema DEBE proveer un endpoint GET /v1/models que retorne la lista de modelos disponibles.

Criterios de Aceptacion:

Endpoint retorna lista de modelos en formato OpenAI
Incluye metadata: id, object, created, owned_by
Lista refleja modelos realmente disponibles en backend
Response cacheable por 60 segundos

Response Schema:

interface ModelsResponse {
  object: "list";
  data: Array<{
    id: string;
    object: "model";
    created: number;
    owned_by: string;
  }>;
}

RF-GW-003: Endpoint de Health Check

Atributo	Valor
ID	RF-GW-003
Nombre	Health Check API
Prioridad	MUST HAVE
Complejidad	Baja
Dependencias	-

Descripcion: El sistema DEBE proveer un endpoint GET /health que indique el estado del servicio.

Criterios de Aceptacion:

Retorna 200 OK cuando servicio esta saludable
Incluye estado de dependencias (inference-engine, ollama)
Retorna 503 si alguna dependencia critica no esta disponible
Tiempo de respuesta < 500ms

Response Schema:

interface HealthResponse {
  status: "healthy" | "degraded" | "unhealthy";
  timestamp: string;
  version: string;
  dependencies: {
    inference_engine: "up" | "down";
    ollama: "up" | "down";
  };
}

RF-GW-004: Router Service - Clasificacion de Tier

Atributo	Valor
ID	RF-GW-004
Nombre	Tier Classification
Prioridad	SHOULD HAVE
Complejidad	Media
Dependencias	RF-GW-001

Descripcion: El sistema DEBE clasificar cada request en un tier (small/main) basado en la complejidad estimada.

Criterios de Aceptacion:

Clasifica request como "small" si tokens estimados < 4096
Clasifica request como "main" si tokens estimados >= 4096
Respeta header X-Tier si se proporciona
Aplica limites de max_tokens segun tier
Registra clasificacion en logs para analisis

Logica de Clasificacion:

interface TierConfig {
  small: {
    max_context: 4096;
    max_tokens: 512;
    latency_target_ms: 500;
  };
  main: {
    max_context: 16384;
    max_tokens: 2048;
    latency_target_ms: 2000;
  };
}

2.2 Modulo: MCP Tools

RF-MCP-001: Endpoint de Lista de Herramientas

Atributo	Valor
ID	RF-MCP-001
Nombre	List MCP Tools
Prioridad	SHOULD HAVE
Complejidad	Baja
Dependencias	-

Descripcion: El sistema DEBE proveer un endpoint GET /mcp/tools que liste las herramientas MCP disponibles.

Criterios de Aceptacion:

Retorna lista de herramientas con nombre, descripcion, parametros
Cada herramienta incluye schema JSON de parametros
Lista refleja herramientas realmente implementadas

Response Schema:

interface MCPToolsResponse {
  tools: Array<{
    name: string;
    description: string;
    parameters: JSONSchema;
  }>;
}

RF-MCP-002: Herramienta Classify

Atributo	Valor
ID	RF-MCP-002
Nombre	MCP Tool: Classify
Prioridad	SHOULD HAVE
Complejidad	Media
Dependencias	RF-GW-001

Descripcion: El sistema DEBE proveer una herramienta MCP para clasificar texto en categorias predefinidas.

Criterios de Aceptacion:

Acepta texto y lista de categorias posibles
Retorna categoria seleccionada con confidence score
Usa tier "small" automaticamente
Latencia < 1s para textos < 500 caracteres

Request Schema:

interface ClassifyRequest {
  text: string;
  categories: string[];
  context?: string;
}

Response Schema:

interface ClassifyResponse {
  category: string;
  confidence: number;  // 0.0 - 1.0
  reasoning?: string;
}

RF-MCP-003: Herramienta Extract

Atributo	Valor
ID	RF-MCP-003
Nombre	MCP Tool: Extract
Prioridad	SHOULD HAVE
Complejidad	Media
Dependencias	RF-GW-001

Descripcion: El sistema DEBE proveer una herramienta MCP para extraer datos estructurados de texto.

Criterios de Aceptacion:

Acepta texto y schema de datos a extraer
Retorna datos estructurados segun schema
Maneja campos opcionales y requeridos
Retorna null para campos no encontrados

Request Schema:

interface ExtractRequest {
  text: string;
  schema: {
    fields: Array<{
      name: string;
      type: "string" | "number" | "date" | "boolean" | "array";
      description: string;
      required?: boolean;
    }>;
  };
}

Response Schema:

interface ExtractResponse {
  data: Record<string, any>;
  confidence: number;
  missing_fields?: string[];
}

RF-MCP-004: Herramienta Summarize

Atributo	Valor
ID	RF-MCP-004
Nombre	MCP Tool: Summarize
Prioridad	SHOULD HAVE
Complejidad	Media
Dependencias	RF-GW-001

Descripcion: El sistema DEBE proveer una herramienta MCP para resumir texto.

Criterios de Aceptacion:

Acepta texto y longitud objetivo del resumen
Retorna resumen respetando longitud especificada
Preserva puntos clave del texto original
Soporta formatos: paragraph, bullets

Request Schema:

interface SummarizeRequest {
  text: string;
  max_length?: number;      // Default: 200 palabras
  format?: "paragraph" | "bullets";
}

Response Schema:

interface SummarizeResponse {
  summary: string;
  word_count: number;
  key_points?: string[];
}

RF-MCP-005: Herramienta Rewrite

Atributo	Valor
ID	RF-MCP-005
Nombre	MCP Tool: Rewrite
Prioridad	SHOULD HAVE
Complejidad	Media
Dependencias	RF-GW-001

Descripcion: El sistema DEBE proveer una herramienta MCP para reescribir texto con un estilo especifico.

Criterios de Aceptacion:

Acepta texto y estilo objetivo
Soporta estilos: formal, casual, technical, simple
Preserva significado del texto original
Retorna texto reescrito

Request Schema:

interface RewriteRequest {
  text: string;
  style: "formal" | "casual" | "technical" | "simple";
  preserve_length?: boolean;
}

Response Schema:

interface RewriteResponse {
  rewritten: string;
  changes_made: number;
}

2.3 Modulo: Inference Engine (Python)

RF-IE-001: Chat Completion Backend

Atributo	Valor
ID	RF-IE-001
Nombre	Inference Chat Completion
Prioridad	MUST HAVE
Complejidad	Alta
Dependencias	Ollama

Descripcion: El Inference Engine DEBE procesar requests de chat completion contra el backend de inferencia (Ollama/vLLM).

Criterios de Aceptacion:

Recibe requests del Gateway via HTTP
Envia request a Ollama en formato nativo
Transforma respuesta a formato OpenAI
Calcula o estima token usage
Maneja timeouts y errores de backend
Soporta configuracion de modelo via environment

Estados:

READY: Backend disponible y modelo cargado
LOADING: Cargando modelo
ERROR: Backend no disponible
DEGRADED: Backend con alta latencia

RF-IE-002: Lista de Modelos Backend

Atributo	Valor
ID	RF-IE-002
Nombre	Backend Models List
Prioridad	MUST HAVE
Complejidad	Baja
Dependencias	Ollama

Descripcion: El Inference Engine DEBE consultar y retornar la lista de modelos disponibles en el backend.

Criterios de Aceptacion:

Consulta Ollama API para lista de modelos
Transforma a formato OpenAI models
Cachea resultado por 60 segundos
Maneja error si backend no disponible

RF-IE-003: Backend Abstraction Layer

Atributo	Valor
ID	RF-IE-003
Nombre	Backend Manager
Prioridad	MUST HAVE
Complejidad	Media
Dependencias	-

Descripcion: El Inference Engine DEBE abstraer el backend de inferencia para soportar multiples implementaciones (Ollama, vLLM).

Criterios de Aceptacion:

Interface comun para todos los backends
Seleccion de backend via environment variable
Fallback a Ollama si backend seleccionado no disponible
Health check por backend

Interface:

class InferenceBackend(ABC):
    @abstractmethod
    async def health_check(self) -> bool: ...

    @abstractmethod
    async def list_models(self) -> List[Dict]: ...

    @abstractmethod
    async def chat_completion(
        self, model: str, messages: List[Dict], **kwargs
    ) -> Dict: ...

2.4 Modulo: Configuracion y Operaciones

RF-CFG-001: Configuracion via Environment

Atributo	Valor
ID	RF-CFG-001
Nombre	Environment Configuration
Prioridad	MUST HAVE
Complejidad	Baja
Dependencias	-

Descripcion: El sistema DEBE ser configurable via variables de entorno.

Variables Requeridas:

# Gateway
GATEWAY_PORT=3160
INFERENCE_HOST=localhost
INFERENCE_PORT=3161

# Inference Engine
INFERENCE_PORT=3161
INFERENCE_BACKEND=ollama        # ollama | vllm
OLLAMA_HOST=http://localhost:11434
OLLAMA_MODEL=gpt-oss-20b

# Opcional
LOG_LEVEL=info
REDIS_HOST=localhost
REDIS_PORT=6379
REDIS_DB=9

RF-CFG-002: Logging Estructurado

Atributo	Valor
ID	RF-CFG-002
Nombre	Structured Logging
Prioridad	SHOULD HAVE
Complejidad	Baja
Dependencias	-

Descripcion: El sistema DEBE emitir logs estructurados en formato JSON.

Criterios de Aceptacion:

Logs en formato JSON
Incluye timestamp, level, message, context
Log level configurable via environment
Incluye request_id para trazabilidad

3. MATRIZ DE TRAZABILIDAD

3.1 Requerimientos por Fase

Fase	Requerimientos	Prioridad
MVP (Fase 1)	RF-GW-001, RF-GW-002, RF-GW-003, RF-IE-001, RF-IE-002, RF-IE-003, RF-CFG-001	MUST HAVE
Multi-Tool (Fase 2)	RF-GW-004, RF-MCP-001 a RF-MCP-005, RF-CFG-002	SHOULD HAVE
Produccion (Fase 3)	vLLM backend, Multi-LoRA, Continuous Batching	NICE TO HAVE

3.2 Dependencias entre Requerimientos

RF-GW-001 ─────┬───> RF-IE-001
               │
RF-GW-002 ─────┼───> RF-IE-002
               │
RF-GW-003 ─────┘

RF-GW-004 ────────> RF-GW-001

RF-MCP-001 ───────> RF-MCP-002, RF-MCP-003, RF-MCP-004, RF-MCP-005

RF-IE-001 ────────> RF-IE-003 ────────> Ollama (external)
RF-IE-002 ────────┘

4. METRICAS DE VERIFICACION

Requerimiento	Metrica	Objetivo
RF-GW-001	Latencia p95	< 2000ms
RF-GW-002	Latencia p95	< 100ms
RF-GW-003	Latencia p95	< 50ms
RF-GW-004	Precision clasificacion	> 95%
RF-IE-001	Throughput	> 10 req/min
RF-MCP-002	Accuracy	> 90%

5. REFERENCIAS

ADR-001: Runtime Selection
ADR-002: Model Selection
ARQUITECTURA-LOCAL-LLM.md
INVENTARIO.yml

Documento Controlado

Autor: Requirements-Analyst Agent
Revisor: Architecture-Analyst Agent
Aprobador: Tech-Leader

14 KiB Raw Blame History

Requerimientos Funcionales - Local LLM Agent

1. VISION GENERAL

1.1 Proposito del Sistema

1.2 Objetivos de Negocio

1.3 Stakeholders

2. REQUERIMIENTOS FUNCIONALES

2.1 Modulo: API Gateway (NestJS)

RF-GW-001: Endpoint de Chat Completion OpenAI-Compatible

RF-GW-002: Endpoint de Lista de Modelos

RF-GW-003: Endpoint de Health Check

RF-GW-004: Router Service - Clasificacion de Tier

2.2 Modulo: MCP Tools

RF-MCP-001: Endpoint de Lista de Herramientas

RF-MCP-002: Herramienta Classify

RF-MCP-003: Herramienta Extract

RF-MCP-004: Herramienta Summarize

RF-MCP-005: Herramienta Rewrite

2.3 Modulo: Inference Engine (Python)

RF-IE-001: Chat Completion Backend

RF-IE-002: Lista de Modelos Backend

RF-IE-003: Backend Abstraction Layer

2.4 Modulo: Configuracion y Operaciones

RF-CFG-001: Configuracion via Environment

RF-CFG-002: Logging Estructurado

3. MATRIZ DE TRAZABILIDAD

3.1 Requerimientos por Fase

3.2 Dependencias entre Requerimientos

4. METRICAS DE VERIFICACION

5. REFERENCIAS

14 KiB

Raw Blame History