Changes include: - Updated architecture documentation - Enhanced module definitions (OQI-001 to OQI-008) - ML integration documentation updates - Trading strategies documentation - Orchestration and inventory updates - Docker configuration updates 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
449 lines
12 KiB
Markdown
449 lines
12 KiB
Markdown
---
|
|
id: "RF-ML-004"
|
|
title: "Pipeline de Entrenamiento"
|
|
type: "Requirement"
|
|
status: "Done"
|
|
priority: "Alta"
|
|
epic: "OQI-006"
|
|
project: "trading-platform"
|
|
version: "1.0.0"
|
|
created_date: "2025-12-05"
|
|
updated_date: "2026-01-04"
|
|
---
|
|
|
|
# RF-ML-004: Pipeline de Entrenamiento
|
|
|
|
**Versión:** 1.0.0
|
|
**Fecha:** 2025-12-05
|
|
**Épica:** OQI-006 - Señales ML y Predicciones
|
|
**Prioridad:** P1
|
|
**Story Points:** 8
|
|
|
|
---
|
|
|
|
## Descripción
|
|
|
|
El sistema debe proporcionar un pipeline completo de entrenamiento de modelos XGBoost, incluyendo descarga de datos históricos, feature engineering, entrenamiento, validación y persistencia de modelos. El pipeline debe ser ejecutable bajo demanda y soportar entrenamiento programado.
|
|
|
|
---
|
|
|
|
## Requisitos Funcionales
|
|
|
|
### RF-ML-004.1: Descarga de Datos Históricos
|
|
|
|
El sistema debe:
|
|
- Descargar datos históricos de Binance API (velas de 5 minutos)
|
|
- Soportar cantidad configurable de samples (default: 500, máximo: 5000)
|
|
- Almacenar datos en formato OHLCV (Open, High, Low, Close, Volume)
|
|
- Validar integridad de datos (sin gaps temporales)
|
|
|
|
**Configuración:**
|
|
```python
|
|
@dataclass
|
|
class TrainingConfig:
|
|
symbol: str # ej: "BTCUSDT"
|
|
samples: int = 500 # Número de velas históricas
|
|
interval: str = "5m" # Siempre 5 minutos
|
|
test_split: float = 0.2 # 20% para testing
|
|
validation_split: float = 0.1 # 10% para validación
|
|
```
|
|
|
|
### RF-ML-004.2: Feature Engineering
|
|
|
|
El sistema debe:
|
|
- Calcular 30+ features técnicas (RF-ML-003)
|
|
- Eliminar filas con valores NaN/Inf
|
|
- Normalizar features si es necesario
|
|
- Crear features de rezagos (lags) si aplica
|
|
|
|
**Features calculadas:**
|
|
- Volatilidad: 8 features
|
|
- Momentum: 6 features
|
|
- Medias Móviles: 12 features
|
|
- Indicadores: 4 features (RSI, MACD, BB)
|
|
- Volumen: 1 feature
|
|
- High/Low: 6+ features
|
|
|
|
### RF-ML-004.3: Generación de Targets
|
|
|
|
El sistema debe generar targets para cada horizonte:
|
|
|
|
```python
|
|
# Para cada horizonte (scalping, intraday, swing, position)
|
|
horizons = {
|
|
'scalping': 6, # 30 min
|
|
'intraday': 18, # 90 min
|
|
'swing': 36, # 3 horas
|
|
'position': 72 # 6 horas
|
|
}
|
|
|
|
for horizon_name, n_candles in horizons.items():
|
|
# Calcular max/min futuro
|
|
future_high = max(high[i:i+n_candles])
|
|
future_low = min(low[i:i+n_candles])
|
|
|
|
# Calcular ratios
|
|
max_ratio = future_high / close[i] - 1
|
|
min_ratio = 1 - future_low / close[i]
|
|
|
|
# Asignar como target
|
|
y_high[i] = max_ratio
|
|
y_low[i] = min_ratio
|
|
```
|
|
|
|
### RF-ML-004.4: División de Datos
|
|
|
|
El sistema debe dividir los datos en:
|
|
|
|
| Set | Porcentaje | Uso |
|
|
|-----|------------|-----|
|
|
| **Training** | 70% | Entrenar el modelo |
|
|
| **Validation** | 10% | Ajustar hiperparámetros |
|
|
| **Test** | 20% | Evaluar performance final |
|
|
|
|
**Importante:** División temporal (no aleatoria) para evitar look-ahead bias.
|
|
|
|
```python
|
|
# División temporal
|
|
total_samples = len(X)
|
|
train_end = int(total_samples * 0.7)
|
|
val_end = int(total_samples * 0.8)
|
|
|
|
X_train, y_train = X[:train_end], y[:train_end]
|
|
X_val, y_val = X[train_end:val_end], y[train_end:val_end]
|
|
X_test, y_test = X[val_end:], y[val_end:]
|
|
```
|
|
|
|
### RF-ML-004.5: Entrenamiento XGBoost
|
|
|
|
El sistema debe entrenar dos modelos por horizonte:
|
|
- **xgb_high:** Predice `max_ratio`
|
|
- **xgb_low:** Predice `min_ratio`
|
|
|
|
**Configuración del modelo:**
|
|
```python
|
|
xgb_params = {
|
|
'objective': 'reg:squarederror',
|
|
'n_estimators': 100,
|
|
'max_depth': 6,
|
|
'learning_rate': 0.1,
|
|
'subsample': 0.8,
|
|
'colsample_bytree': 0.8,
|
|
'min_child_weight': 1,
|
|
'random_state': 42,
|
|
'n_jobs': -1 # Usar todos los cores
|
|
}
|
|
|
|
# Entrenar
|
|
model_high = XGBRegressor(**xgb_params)
|
|
model_high.fit(X_train, y_high_train)
|
|
|
|
model_low = XGBRegressor(**xgb_params)
|
|
model_low.fit(X_train, y_low_train)
|
|
```
|
|
|
|
### RF-ML-004.6: Validación y Métricas
|
|
|
|
El sistema debe calcular métricas de performance:
|
|
|
|
**Métricas principales:**
|
|
- **MAE (Mean Absolute Error):** Error promedio absoluto
|
|
- **RMSE (Root Mean Squared Error):** Raíz del error cuadrático medio
|
|
- **R² Score:** Coeficiente de determinación
|
|
|
|
```python
|
|
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
|
|
|
|
# Predecir en test set
|
|
y_pred_high = model_high.predict(X_test)
|
|
y_pred_low = model_low.predict(X_test)
|
|
|
|
# Calcular métricas
|
|
metrics = {
|
|
'high_mae': mean_absolute_error(y_test_high, y_pred_high),
|
|
'high_rmse': np.sqrt(mean_squared_error(y_test_high, y_pred_high)),
|
|
'high_r2': r2_score(y_test_high, y_pred_high),
|
|
'low_mae': mean_absolute_error(y_test_low, y_pred_low),
|
|
'low_rmse': np.sqrt(mean_squared_error(y_test_low, y_pred_low)),
|
|
'low_r2': r2_score(y_test_low, y_pred_low),
|
|
'train_samples': len(X_train),
|
|
'test_samples': len(X_test)
|
|
}
|
|
```
|
|
|
|
**Umbrales de aceptación:**
|
|
| Métrica | Umbral | Significado |
|
|
|---------|--------|-------------|
|
|
| high_mae | < 0.02 | Error < 2% |
|
|
| high_rmse | < 0.025 | RMSE < 2.5% |
|
|
| low_mae | < 0.02 | Error < 2% |
|
|
| low_rmse | < 0.03 | RMSE < 3% |
|
|
|
|
### RF-ML-004.7: Persistencia de Modelos
|
|
|
|
El sistema debe:
|
|
- Guardar modelos entrenados en formato JSON (XGBoost nativo)
|
|
- Incluir metadata (símbolo, fecha, métricas, versión)
|
|
- Versionado de modelos
|
|
|
|
**Estructura de archivos:**
|
|
```
|
|
apps/ml-services/trained_models/
|
|
├── BTCUSDT/
|
|
│ ├── scalping/
|
|
│ │ ├── xgb_high_v1.0.0.json
|
|
│ │ ├── xgb_low_v1.0.0.json
|
|
│ │ └── metadata.json
|
|
│ ├── intraday/
|
|
│ ├── swing/
|
|
│ └── position/
|
|
└── ETHUSDT/
|
|
└── ...
|
|
```
|
|
|
|
**Metadata.json:**
|
|
```json
|
|
{
|
|
"model_version": "1.0.0",
|
|
"symbol": "BTCUSDT",
|
|
"horizon": "scalping",
|
|
"trained_at": "2025-12-05T18:45:00.000Z",
|
|
"samples": 500,
|
|
"train_samples": 350,
|
|
"test_samples": 100,
|
|
"metrics": {
|
|
"high_mae": 0.00099,
|
|
"high_rmse": 0.00141,
|
|
"low_mae": 0.00173,
|
|
"low_rmse": 0.00284
|
|
},
|
|
"features": ["volatility_5", "rsi_14", ...],
|
|
"xgb_params": { ... }
|
|
}
|
|
```
|
|
|
|
### RF-ML-004.8: Entrenamiento Programado
|
|
|
|
El sistema debe soportar:
|
|
- Entrenamiento manual vía API: `POST /api/train/{symbol}`
|
|
- Entrenamiento programado (cron job): semanal, cada domingo 2:00 AM
|
|
- Re-entrenamiento automático si MAE > umbral (degradación del modelo)
|
|
|
|
---
|
|
|
|
## Datos de Entrada
|
|
|
|
| Campo | Tipo | Descripción | Requerido |
|
|
|-------|------|-------------|-----------|
|
|
| symbol | string | Par de trading | Sí |
|
|
| samples | number | Cantidad de velas históricas | No (default: 500) |
|
|
| horizon | enum | Horizonte específico o "all" | No (default: "all") |
|
|
|
|
---
|
|
|
|
## Datos de Salida
|
|
|
|
### Inicio de Entrenamiento
|
|
|
|
```typescript
|
|
interface TrainingStartResponse {
|
|
status: 'training_started' | 'already_training';
|
|
symbol: string;
|
|
samples: number;
|
|
horizon: string;
|
|
message: string;
|
|
}
|
|
```
|
|
|
|
**Ejemplo:**
|
|
```json
|
|
{
|
|
"status": "training_started",
|
|
"symbol": "BTCUSDT",
|
|
"samples": 500,
|
|
"horizon": "all",
|
|
"message": "Model training started in background. Check /api/training/status for progress."
|
|
}
|
|
```
|
|
|
|
### Estado de Entrenamiento
|
|
|
|
```typescript
|
|
interface TrainingStatus {
|
|
training_in_progress: boolean;
|
|
is_trained: boolean;
|
|
current_symbol?: string;
|
|
progress_pct?: number;
|
|
last_training?: {
|
|
symbol: string;
|
|
timestamp: string;
|
|
samples: number;
|
|
metrics: {
|
|
high_mae: number;
|
|
high_rmse: number;
|
|
low_mae: number;
|
|
low_rmse: number;
|
|
train_samples: number;
|
|
test_samples: number;
|
|
};
|
|
};
|
|
}
|
|
```
|
|
|
|
**Ejemplo:**
|
|
```json
|
|
{
|
|
"training_in_progress": false,
|
|
"is_trained": true,
|
|
"last_training": {
|
|
"symbol": "BTCUSDT",
|
|
"timestamp": "2025-12-05T18:45:23.123456Z",
|
|
"samples": 500,
|
|
"metrics": {
|
|
"high_mae": 0.00099,
|
|
"high_rmse": 0.00141,
|
|
"low_mae": 0.00173,
|
|
"low_rmse": 0.00284,
|
|
"train_samples": 355,
|
|
"test_samples": 89
|
|
}
|
|
}
|
|
}
|
|
```
|
|
|
|
---
|
|
|
|
## Reglas de Negocio
|
|
|
|
1. **Un Entrenamiento a la Vez:** Solo puede haber un proceso de entrenamiento activo
|
|
2. **Mínimo de Samples:** Al menos 100 samples para entrenar
|
|
3. **Máximo de Samples:** Máximo 5000 samples (limitación de API Binance)
|
|
4. **Sobrescritura:** Un nuevo entrenamiento sobrescribe el modelo anterior
|
|
5. **Validación Pre-entrenamiento:** Verificar disponibilidad de datos antes de iniciar
|
|
6. **Timeout:** El entrenamiento tiene timeout de 30 minutos
|
|
|
|
---
|
|
|
|
## Criterios de Aceptación
|
|
|
|
```gherkin
|
|
Escenario: Iniciar entrenamiento manual
|
|
DADO que el usuario es administrador
|
|
CUANDO hace POST /api/train/BTCUSDT?samples=500
|
|
ENTONCES el entrenamiento inicia en background
|
|
Y recibe status "training_started"
|
|
Y puede consultar el progreso en /api/training/status
|
|
|
|
Escenario: Entrenamiento completo exitoso
|
|
DADO que el entrenamiento ha finalizado
|
|
CUANDO consulta /api/training/status
|
|
ENTONCES training_in_progress = false
|
|
Y is_trained = true
|
|
Y last_training contiene métricas
|
|
Y high_mae < 0.02 y low_mae < 0.02
|
|
|
|
Escenario: Entrenamiento con datos insuficientes
|
|
DADO que solo existen 50 velas históricas
|
|
CUANDO intenta entrenar
|
|
ENTONCES recibe error 400 Bad Request
|
|
Y el mensaje indica "Insufficient data"
|
|
|
|
Escenario: Modelo entrenado está disponible
|
|
DADO que el entrenamiento finalizó exitosamente
|
|
CUANDO hace GET /api/predict/BTCUSDT
|
|
ENTONCES usa el modelo recién entrenado
|
|
Y is_trained = true en la respuesta
|
|
```
|
|
|
|
---
|
|
|
|
## Dependencias
|
|
|
|
### Técnicas:
|
|
- **XGBoost 2.0+:** Motor de ML
|
|
- **scikit-learn:** Métricas y validación
|
|
- **Pandas/NumPy:** Procesamiento de datos
|
|
- **Binance API:** Datos históricos
|
|
- **Celery (opcional):** Background tasks
|
|
|
|
### Funcionales:
|
|
- **RF-ML-003:** Indicadores técnicos (features)
|
|
- Requiere infraestructura de almacenamiento para modelos
|
|
|
|
---
|
|
|
|
## Notas Técnicas
|
|
|
|
### Pipeline Completo
|
|
|
|
```python
|
|
# apps/ml-services/src/models/training_pipeline.py
|
|
|
|
class TrainingPipeline:
|
|
"""
|
|
Pipeline completo de entrenamiento
|
|
"""
|
|
|
|
async def train(self, symbol: str, samples: int = 500) -> dict:
|
|
# 1. Descargar datos
|
|
logger.info(f"Downloading {samples} candles for {symbol}")
|
|
ohlcv = await market_data.fetch_ohlcv(symbol, limit=samples)
|
|
|
|
# 2. Calcular features
|
|
logger.info("Calculating technical indicators")
|
|
features = TechnicalIndicators.calculate_all(ohlcv)
|
|
|
|
# 3. Generar targets para cada horizonte
|
|
logger.info("Generating targets")
|
|
targets = self._generate_targets(ohlcv)
|
|
|
|
# 4. Entrenar modelos para cada horizonte
|
|
results = {}
|
|
for horizon in ['scalping', 'intraday', 'swing', 'position']:
|
|
logger.info(f"Training {horizon} models")
|
|
|
|
# Dividir datos
|
|
X_train, X_test, y_train, y_test = self._split_data(
|
|
features, targets[horizon]
|
|
)
|
|
|
|
# Entrenar
|
|
model_high = XGBRegressor(**xgb_params)
|
|
model_low = XGBRegressor(**xgb_params)
|
|
|
|
model_high.fit(X_train, y_train['high'])
|
|
model_low.fit(X_train, y_train['low'])
|
|
|
|
# Validar
|
|
metrics = self._calculate_metrics(
|
|
model_high, model_low, X_test, y_test
|
|
)
|
|
|
|
# Guardar
|
|
self._save_models(symbol, horizon, model_high, model_low, metrics)
|
|
|
|
results[horizon] = metrics
|
|
|
|
return results
|
|
```
|
|
|
|
### Performance:
|
|
- Entrenamiento completo (4 horizontes): ~2-5 minutos con 500 samples
|
|
- Download de datos: ~10 segundos
|
|
- Feature engineering: ~5 segundos
|
|
- Entrenamiento XGBoost: ~30 segundos por horizonte
|
|
|
|
---
|
|
|
|
## Referencias
|
|
|
|
- [XGBoost Training Guide](https://xgboost.readthedocs.io/en/latest/tutorials/model.html)
|
|
- [Scikit-learn Cross Validation](https://scikit-learn.org/stable/modules/cross_validation.html)
|
|
- [Time Series Split](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.TimeSeriesSplit.html)
|
|
|
|
---
|
|
|
|
**Creado por:** Requirements-Analyst
|
|
**Fecha:** 2025-12-05
|
|
**Última actualización:** 2025-12-05
|