# RF-MGN-018-002: Bases de Conocimiento

**Módulo:** MGN-018 - AI Agents & Chatbots
**Prioridad:** P1
**Story Points:** 13
**Estado:** Definido
**Fecha:** 2025-12-05

## Descripción

El sistema debe permitir crear bases de conocimiento (Knowledge Bases) que los agentes de IA pueden consultar para dar respuestas precisas basadas en información específica del tenant. Las bases de conocimiento soportan documentos en múltiples formatos que se procesan, segmentan y vectorizan para búsqueda semántica (RAG).

## Actores

- **Actor Principal:** Tenant Admin
- **Actores Secundarios:**
  - Sistema (procesa y vectoriza documentos)
  - AI Agent (consulta KB)
  - Embedding API (genera vectores)

## Precondiciones

1. Tenant debe tener feature `ai_agents_enabled`
2. Créditos de IA suficientes para procesamiento
3. Espacio de almacenamiento disponible según plan

## Flujo Principal - Crear Base de Conocimiento

1. Admin accede a "AI Agents > Bases de Conocimiento"
2. Admin selecciona "Nueva Base de Conocimiento"
3. Admin ingresa nombre y descripción
4. Admin configura opciones:
   - Idioma principal
   - Modelo de embeddings
   - Chunk size y overlap
5. Sistema crea KB vacía
6. Admin sube documentos
7. Sistema procesa cada documento:
   - Extrae texto
   - Segmenta en chunks
   - Genera embeddings
   - Almacena en vector store
8. Sistema confirma procesamiento exitoso
9. Admin asigna KB a uno o más agentes

## Flujo Alternativo - Agregar Documentos

1. Admin accede a KB existente
2. Admin selecciona "Agregar documentos"
3. Admin arrastra archivos o usa selector
4. Sistema valida formato y tamaño
5. Sistema encola para procesamiento
6. Sistema notifica cuando procesamiento termina
7. Documentos quedan disponibles para consulta

## Formatos Soportados

| Formato | Extensiones | Límite Tamaño | Notas |
|---------|-------------|---------------|-------|
| PDF | .pdf | 50 MB | OCR si es escaneado |
| Word | .docx, .doc | 25 MB | Preserva estructura |
| Excel | .xlsx, .xls | 10 MB | Cada hoja como documento |
| PowerPoint | .pptx, .ppt | 50 MB | Texto de slides |
| Texto | .txt, .md | 10 MB | Plain text |
| HTML | .html, .htm | 10 MB | Extrae texto limpio |
| CSV | .csv | 10 MB | Filas como documentos |
| JSON | .json | 10 MB | Estructurado |

## Procesamiento de Documentos

### Pipeline de Procesamiento

```
┌─────────────┐     ┌─────────────┐     ┌─────────────┐
│   Upload    │────▶│   Parser    │────▶│   Chunker   │
│   File      │     │   (Format)  │     │             │
└─────────────┘     └─────────────┘     └─────────────┘
                                               │
                                               ▼
┌─────────────┐     ┌─────────────┐     ┌─────────────┐
│   Store     │◀────│   Vector    │◀────│  Embedding  │
│   (pgvector)│     │   Index     │     │  Generator  │
└─────────────┘     └─────────────┘     └─────────────┘
```

### Chunking Strategy

```typescript
interface ChunkingConfig {
  strategy: 'fixed' | 'semantic' | 'paragraph';

  // Para fixed strategy
  chunk_size: number;        // Caracteres por chunk (default: 1000)
  chunk_overlap: number;     // Solapamiento (default: 200)

  // Para semantic strategy
  max_chunk_size: number;
  min_chunk_size: number;
  separator_regex?: string;

  // Metadata a incluir
  include_title: boolean;
  include_page_number: boolean;
  include_source_url: boolean;
}
```

### Modelo de Embeddings

```typescript
interface EmbeddingConfig {
  provider: 'openai' | 'cohere' | 'custom';
  model: string;           // "text-embedding-3-small", "text-embedding-3-large"
  dimensions: number;      // 1536 para OpenAI small

  // Procesamiento batch
  batch_size: number;      // Chunks por llamada API
  concurrent_batches: number;
}
```

## Búsqueda RAG

### Query Processing

```typescript
interface RAGQuery {
  query: string;
  knowledge_base_ids: string[];

  // Opciones de búsqueda
  top_k: number;              // Número de chunks a retornar (default: 5)
  similarity_threshold: number; // Mínimo de similitud (default: 0.7)
  rerank?: boolean;           // Re-ranking de resultados

  // Filtros
  filters?: {
    document_ids?: string[];
    metadata?: Record<string, any>;
    date_range?: { from: Date; to: Date };
  };
}

interface RAGResult {
  chunks: Array<{
    content: string;
    document_id: string;
    document_name: string;
    page_number?: number;
    similarity_score: number;
    metadata: Record<string, any>;
  }>;

  // Contexto formateado para LLM
  formatted_context: string;

  // Estadísticas
  search_time_ms: number;
  total_chunks_searched: number;
}
```

## Reglas de Negocio

- **RN-1:** Máximo de KBs según plan (ej: 3 para Professional)
- **RN-2:** Máximo de documentos por KB según plan
- **RN-3:** Tamaño total de KB según plan (ej: 500MB)
- **RN-4:** Documentos duplicados se detectan y evitan
- **RN-5:** Re-procesamiento consume créditos adicionales
- **RN-6:** KBs no asignadas a agentes no se consultan
- **RN-7:** Eliminar documento elimina todos sus chunks

## Criterios de Aceptación

- [ ] Admin puede crear KB con nombre y configuración
- [ ] Todos los formatos de documento soportados
- [ ] Progress bar durante procesamiento
- [ ] Vista de chunks generados por documento
- [ ] Búsqueda de prueba disponible
- [ ] Asignación de KB a múltiples agentes
- [ ] Estadísticas de uso de KB
- [ ] Eliminar documentos individual o masivamente
- [ ] Re-procesar documentos si hay errores
- [ ] Exportar lista de documentos

## Entidades Involucradas

### ai_agents.knowledge_bases

```sql
CREATE TABLE ai_agents.knowledge_bases (
    id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    tenant_id UUID NOT NULL REFERENCES core_tenants.tenants(id),

    -- Información básica
    name VARCHAR(100) NOT NULL,
    description TEXT,
    language VARCHAR(10) DEFAULT 'es',

    -- Configuración de chunking
    chunking_config JSONB DEFAULT '{
      "strategy": "fixed",
      "chunk_size": 1000,
      "chunk_overlap": 200
    }',

    -- Configuración de embeddings
    embedding_config JSONB DEFAULT '{
      "provider": "openai",
      "model": "text-embedding-3-small",
      "dimensions": 1536
    }',

    -- Estado
    status VARCHAR(20) DEFAULT 'active',

    -- Estadísticas
    total_documents INT DEFAULT 0,
    total_chunks INT DEFAULT 0,
    total_size_bytes BIGINT DEFAULT 0,
    total_tokens_used BIGINT DEFAULT 0,

    -- Timestamps
    created_at TIMESTAMPTZ DEFAULT CURRENT_TIMESTAMP,
    updated_at TIMESTAMPTZ DEFAULT CURRENT_TIMESTAMP,
    last_queried_at TIMESTAMPTZ,
    created_by UUID,

    CONSTRAINT chk_status CHECK (status IN ('active', 'processing', 'archived'))
);

CREATE INDEX idx_kb_tenant ON ai_agents.knowledge_bases(tenant_id);
```

### ai_agents.kb_documents

```sql
CREATE TABLE ai_agents.kb_documents (
    id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    knowledge_base_id UUID NOT NULL REFERENCES ai_agents.knowledge_bases(id) ON DELETE CASCADE,

    -- Información del archivo
    original_filename VARCHAR(255) NOT NULL,
    file_type VARCHAR(20) NOT NULL, -- pdf, docx, txt, etc.
    file_size_bytes BIGINT NOT NULL,
    storage_path VARCHAR(500), -- Path en cloud storage

    -- Procesamiento
    status VARCHAR(20) NOT NULL DEFAULT 'pending',
    -- pending, processing, completed, failed
    processing_started_at TIMESTAMPTZ,
    processing_completed_at TIMESTAMPTZ,
    error_message TEXT,

    -- Resultados
    chunks_count INT DEFAULT 0,
    pages_count INT,
    tokens_used INT DEFAULT 0,

    -- Metadata extraída
    title VARCHAR(500),
    author VARCHAR(200),
    created_date DATE,
    custom_metadata JSONB DEFAULT '{}',

    -- Hash para detectar duplicados
    content_hash VARCHAR(64), -- SHA-256 del contenido

    -- Timestamps
    created_at TIMESTAMPTZ DEFAULT CURRENT_TIMESTAMP,
    updated_at TIMESTAMPTZ DEFAULT CURRENT_TIMESTAMP,
    created_by UUID,

    CONSTRAINT chk_doc_status CHECK (status IN ('pending', 'processing', 'completed', 'failed'))
);

CREATE INDEX idx_kb_docs_kb ON ai_agents.kb_documents(knowledge_base_id);
CREATE INDEX idx_kb_docs_status ON ai_agents.kb_documents(status);
CREATE INDEX idx_kb_docs_hash ON ai_agents.kb_documents(content_hash);
```

### ai_agents.kb_chunks

```sql
-- Usar extensión pgvector
CREATE EXTENSION IF NOT EXISTS vector;

CREATE TABLE ai_agents.kb_chunks (
    id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    document_id UUID NOT NULL REFERENCES ai_agents.kb_documents(id) ON DELETE CASCADE,
    knowledge_base_id UUID NOT NULL REFERENCES ai_agents.knowledge_bases(id) ON DELETE CASCADE,
    tenant_id UUID NOT NULL, -- Para particionamiento y queries rápidos

    -- Contenido
    content TEXT NOT NULL,
    content_tokens INT NOT NULL,

    -- Posición en documento
    chunk_index INT NOT NULL,
    page_number INT,
    start_char INT,
    end_char INT,

    -- Vector embedding
    embedding vector(1536), -- Dimensión de OpenAI text-embedding-3-small

    -- Metadata
    metadata JSONB DEFAULT '{}',
    -- {
    --   "title": "Sección 2.1",
    --   "headers": ["Capítulo 2", "Introducción"],
    --   "source_url": "https://..."
    -- }

    created_at TIMESTAMPTZ DEFAULT CURRENT_TIMESTAMP
);

-- Índices para búsqueda vectorial
CREATE INDEX idx_kb_chunks_document ON ai_agents.kb_chunks(document_id);
CREATE INDEX idx_kb_chunks_kb ON ai_agents.kb_chunks(knowledge_base_id);
CREATE INDEX idx_kb_chunks_tenant ON ai_agents.kb_chunks(tenant_id);

-- Índice HNSW para búsqueda de similitud (mejor rendimiento)
CREATE INDEX idx_kb_chunks_embedding ON ai_agents.kb_chunks
    USING hnsw (embedding vector_cosine_ops)
    WITH (m = 16, ef_construction = 64);
```

### ai_agents.kb_agent_assignments

```sql
CREATE TABLE ai_agents.kb_agent_assignments (
    id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    agent_id UUID NOT NULL REFERENCES ai_agents.agents(id) ON DELETE CASCADE,
    knowledge_base_id UUID NOT NULL REFERENCES ai_agents.knowledge_bases(id) ON DELETE CASCADE,

    -- Configuración específica para este agente
    priority INT DEFAULT 1, -- Si hay múltiples KBs, orden de búsqueda
    search_weight DECIMAL(3,2) DEFAULT 1.0, -- Peso en resultados

    created_at TIMESTAMPTZ DEFAULT CURRENT_TIMESTAMP,

    CONSTRAINT uq_agent_kb UNIQUE (agent_id, knowledge_base_id)
);
```

## API Endpoints

### Crear Knowledge Base

```typescript
// POST /api/v1/knowledge-bases
interface CreateKBRequest {
  name: string;
  description?: string;
  language?: string;
  chunking_config?: ChunkingConfig;
  embedding_config?: EmbeddingConfig;
}
```

### Subir Documentos

```typescript
// POST /api/v1/knowledge-bases/{id}/documents
// Content-Type: multipart/form-data

// Response
interface UploadResponse {
  document_id: string;
  filename: string;
  status: 'pending';
  estimated_processing_time_seconds: number;
}
```

### Consultar Estado de Procesamiento

```typescript
// GET /api/v1/knowledge-bases/{id}/documents/{docId}/status
interface ProcessingStatus {
  status: 'pending' | 'processing' | 'completed' | 'failed';
  progress_percent?: number;
  chunks_created?: number;
  error_message?: string;
}
```

### Búsqueda de Prueba

```typescript
// POST /api/v1/knowledge-bases/{id}/search
interface SearchRequest {
  query: string;
  top_k?: number;
  similarity_threshold?: number;
}

interface SearchResponse {
  results: Array<{
    content: string;
    document_name: string;
    page_number?: number;
    similarity_score: number;
  }>;
  search_time_ms: number;
}
```

## Función de Búsqueda Vectorial

```sql
-- Función para búsqueda semántica
CREATE OR REPLACE FUNCTION ai_agents.search_knowledge_base(
    p_tenant_id UUID,
    p_kb_ids UUID[],
    p_query_embedding vector(1536),
    p_top_k INT DEFAULT 5,
    p_similarity_threshold DECIMAL DEFAULT 0.7
)
RETURNS TABLE (
    chunk_id UUID,
    document_id UUID,
    document_name VARCHAR,
    content TEXT,
    page_number INT,
    similarity_score DECIMAL
) AS $$
BEGIN
    RETURN QUERY
    SELECT
        c.id,
        c.document_id,
        d.original_filename,
        c.content,
        c.page_number,
        (1 - (c.embedding <=> p_query_embedding))::DECIMAL AS similarity
    FROM ai_agents.kb_chunks c
    JOIN ai_agents.kb_documents d ON c.document_id = d.id
    WHERE c.tenant_id = p_tenant_id
      AND c.knowledge_base_id = ANY(p_kb_ids)
      AND d.status = 'completed'
      AND (1 - (c.embedding <=> p_query_embedding)) >= p_similarity_threshold
    ORDER BY c.embedding <=> p_query_embedding
    LIMIT p_top_k;
END;
$$ LANGUAGE plpgsql;
```

## Interfaz de Usuario

### Vista de Knowledge Base

```
┌─────────────────────────────────────────────────────────────────┐
│  📚 Base de Conocimiento: Productos 2025                        │
├─────────────────────────────────────────────────────────────────┤
│                                                                  │
│  Documentos: 24    Chunks: 1,458    Tamaño: 45.2 MB             │
│                                                                  │
│  [+ Agregar documentos]  [⚙️ Configuración]  [🔍 Probar búsqueda]│
│                                                                  │
├─────────────────────────────────────────────────────────────────┤
│                                                                  │
│  ☐  Nombre                    Tipo   Tamaño    Estado    Chunks │
│  ───────────────────────────────────────────────────────────────│
│  ☐  catalogo-2025.pdf         PDF    12.4 MB   ✅ Listo    342   │
│  ☐  manual-usuario.docx       DOCX   2.1 MB    ✅ Listo    89    │
│  ☐  precios-mayoreo.xlsx      XLSX   1.2 MB    ✅ Listo    156   │
│  ☐  faq-clientes.md           MD     45 KB     ✅ Listo    23    │
│  ☐  politicas-devolucion.pdf  PDF    890 KB    ⏳ 45%      -     │
│                                                                  │
│                                [Eliminar seleccionados]          │
└─────────────────────────────────────────────────────────────────┘
```

### Modal de Búsqueda de Prueba

```
┌─────────────────────────────────────────────────────────────────┐
│  🔍 Probar Búsqueda en Knowledge Base                           │
├─────────────────────────────────────────────────────────────────┤
│                                                                  │
│  Consulta:                                                       │
│  ┌─────────────────────────────────────────────────────────────┐│
│  │ ¿Cuál es la política de devoluciones para productos        ││
│  │ electrónicos?                                               ││
│  └─────────────────────────────────────────────────────────────┘│
│                                                                  │
│  Top K: [5 ▼]    Umbral similitud: [0.7 ▼]     [🔍 Buscar]      │
│                                                                  │
├─────────────────────────────────────────────────────────────────┤
│                                                                  │
│  Resultados (3 encontrados en 45ms):                            │
│                                                                  │
│  ┌───────────────────────────────────────────────────────────┐  │
│  │ 📄 politicas-devolucion.pdf (pág. 3)          Score: 0.92 │  │
│  │                                                            │  │
│  │ "Los productos electrónicos tienen un período de          │  │
│  │ devolución de 30 días desde la fecha de compra..."        │  │
│  └───────────────────────────────────────────────────────────┘  │
│                                                                  │
│  ┌───────────────────────────────────────────────────────────┐  │
│  │ 📄 faq-clientes.md                            Score: 0.85 │  │
│  │                                                            │  │
│  │ "P: ¿Puedo devolver un televisor? R: Sí, dentro de..."   │  │
│  └───────────────────────────────────────────────────────────┘  │
│                                                                  │
└─────────────────────────────────────────────────────────────────┘
```

## Referencias

- [OpenAI Embeddings](https://platform.openai.com/docs/guides/embeddings)
- [pgvector Documentation](https://github.com/pgvector/pgvector)
- [RAG Best Practices](https://www.pinecone.io/learn/retrieval-augmented-generation/)

## Dependencias

- **RF Requeridos:** RF-018-001 (Configuración de Agentes)
- **Bloqueante para:** RF-018-003 (Procesamiento de Mensajes)