# Reporte de Analisis: RangePredictor R^2 Negativo
## Trading Platform - Sprint 1, Tarea S1-T1

**Fecha:** 2026-01-07
**Ejecutor:** Claude Opus 4.5 (ML-SPECIALIST)
**Estado:** COMPLETADO

---

## 1. RESUMEN EJECUTIVO

### 1.1 Problema Identificado

El modelo `RangePredictor` presenta **R^2 negativo** en todas las evaluaciones:

| Modelo | Symbol | Timeframe | Target | R^2 |
|--------|--------|-----------|--------|-----|
| GBPUSD_5m_high_h3 | GBPUSD | 5m | high | **-0.6309** |
| GBPUSD_5m_low_h3 | GBPUSD | 5m | low | **-0.6558** |
| GBPUSD_15m_high_h3 | GBPUSD | 15m | high | **-0.6944** |
| GBPUSD_15m_low_h3 | GBPUSD | 15m | low | **-0.7500** |

**Interpretacion:** Un R^2 negativo significa que el modelo predice PEOR que simplemente usar la media historica como prediccion.

### 1.2 Impacto

- Predicciones de rango inutiles para trading
- Sistema de senales ML no operativo
- Backtesting con win rate bajo (42.1%)

---

## 2. ANALISIS DE CAUSAS RAIZ

### 2.1 Causa 1: Targets Normalizados con Escala Incorrecta

**Archivo:** `src/data/targets.py`

**Hallazgo:**
El target se calcula en valores absolutos (USD) pero las features estan normalizadas.
En el entrenamiento, los valores de target son muy pequenos (0.0005 - 0.001) debido a normalizacion implicita.

```python
# Linea 206-207 de targets.py
df[f'delta_high_{horizon.name}'] = future_high - df['close']  # Valores en USD
df[f'delta_low_{horizon.name}'] = df['close'] - future_low   # Valores en USD
```

**Problema:**
- Para GBPUSD, delta_high podria ser 0.0005 (5 pips)
- El modelo XGBoost tiene dificultad con valores tan pequenos
- La varianza del target es minima comparada con el ruido

**Solucion Propuesta:**
1. Normalizar targets por ATR antes de entrenar
2. Usar targets en pips o puntos en lugar de precio absoluto
3. Escalar features y targets de forma consistente

---

### 2.2 Causa 2: Features No Predictivas para el Target

**Archivo:** `src/data/features.py`

**Hallazgo:**
Las features son principalmente indicadores tecnicos (RSI, MACD, Bollinger) que son:
- Lagging indicators (basados en precio pasado)
- No tienen relacion directa con rango futuro
- Estan diseados para direccion, no para magnitud

**Features Actuales (Lineas 17-27):**
```python
'minimal': [
    'rsi', 'macd', 'macd_signal', 'bb_upper', 'bb_lower',
    'atr', 'volume_zscore', 'returns', 'log_returns'
]
```

**Problema:**
- RSI predice condicion de sobrecompra/sobreventa, NO rango futuro
- MACD predice tendencia, NO magnitud
- Solo `atr` tiene relacion con volatilidad futura

**Solucion Propuesta:**
1. Agregar features de volatilidad: ATR lags, volatilidad historica
2. Agregar features de sesion: hora, dia de semana (codificados ciclicamente)
3. Agregar features de momentum de volatilidad: cambio en ATR
4. Reducir features de direccion no relevantes

---

### 2.3 Causa 3: Sample Weighting Agresivo

**Archivo:** `src/training/sample_weighting.py`

**Hallazgo:**
El weighting de samples (softplus con beta=4.0) es muy agresivo:
- Reduce peso de movimientos "normales" casi a cero
- Solo entrena efectivamente con movimientos extremos
- Esto causa sesgo hacia predicciones de alto rango

**Configuracion Actual (Lineas 66-69):**
```python
softplus_beta: float = 4.0      # MUY agresivo
softplus_w_max: float = 3.0
```

**Problema:**
- Modelo aprende solo de 24-35% de los datos (high flow periods)
- Predicciones sesgadas hacia valores altos
- Varianza de prediccion muy baja (no captura distribucion real)

**Solucion Propuesta:**
1. Reducir softplus_beta a 2.0 o menos
2. Aumentar min_weight para incluir mas samples
3. Considerar weighting uniforme como baseline

---

### 2.4 Causa 4: Data Leakage Potencial

**Archivo:** `src/training/sample_weighting.py`, `src/data/corrected_targets.py`

**Hallazgo:**
Aunque se usa `shift(1)` en el factor de rolling median, hay posible leakage en:
1. Targets que incluyen precio actual en calculo de futuros
2. Features que usan datos futuros implicitamente

**Verificacion Requerida:**
```python
# Linea 126-129 sample_weighting.py
factor = candle_range.rolling(
    window=window,
    min_periods=min_periods
).median().shift(1)  # Correcto - usa shift(1)

# Linea 190-195 targets.py - VERIFICAR
for i in range(start, end + 1):  # start=1, correcto
    future_highs.append(df['high'].shift(-i))
```

**Resultado:** El codigo de targets usa `start_offset=1`, lo cual es correcto.
No hay data leakage evidente en targets, pero hay que verificar features.

---

### 2.5 Causa 5: Hiperparametros XGBoost No Optimizados

**Archivo:** `src/models/range_predictor.py`

**Configuracion Actual (Lineas 146-162):**
```python
'xgboost': {
    'n_estimators': 200,
    'max_depth': 5,
    'learning_rate': 0.05,
    'subsample': 0.8,
    'colsample_bytree': 0.8,
    'min_child_weight': 3,
    'gamma': 0.1,
    'reg_alpha': 0.1,
    'reg_lambda': 1.0,
}
```

**Problema:**
- `max_depth=5` puede ser muy profundo para datos ruidosos
- `learning_rate=0.05` combinado con `n_estimators=200` puede overfit
- `min_child_weight=3` puede ser muy bajo

**Solucion Propuesta:**
1. Reducir `max_depth` a 3
2. Aumentar `min_child_weight` a 10
3. Aumentar regularizacion (`reg_alpha`, `reg_lambda`)
4. Usar early stopping mas agresivo

---

## 3. PLAN DE CORRECCION

### 3.1 Fase 1: Correccion de Targets (Prioridad ALTA)

**Archivo:** `src/data/targets.py`

**Cambios:**
1. Normalizar targets por ATR:
```python
# Agregar normalizacion
df[f'delta_high_{horizon.name}_norm'] = (future_high - df['close']) / df['ATR']
df[f'delta_low_{horizon.name}_norm'] = (df['close'] - future_low) / df['ATR']
```

2. Usar targets normalizados en entrenamiento

**Beneficio Esperado:** Targets en escala [-3, 3] en lugar de [0, 0.001]

---

### 3.2 Fase 2: Correccion de Features (Prioridad ALTA)

**Archivo:** `src/data/features.py`

**Cambios:**
1. Agregar features de volatilidad:
```python
'volatility': [
    'atr',
    'atr_ratio',  # ATR / rolling_median(ATR)
    'atr_pct_change',
    'range_pct',  # (high-low)/close
    'true_range',
    'realized_volatility_10',
    'realized_volatility_20'
]
```

2. Agregar features de sesion (ya existen en `create_time_features`):
```python
# Ya implementado correctamente
df['hour_sin'] = np.sin(2 * np.pi * df['hour'] / 24)
df['hour_cos'] = np.cos(2 * np.pi * df['hour'] / 24)
```

3. Usar solo features relevantes para prediccion de rango

---

### 3.3 Fase 3: Ajuste de Sample Weighting (Prioridad MEDIA)

**Archivo:** `src/training/sample_weighting.py`

**Cambios:**
```python
# Configuracion menos agresiva
SampleWeightConfig(
    softplus_beta=2.0,        # Reducir de 4.0
    softplus_w_max=2.0,       # Reducir de 3.0
    min_weight=0.3,           # Aumentar de 0.1
    filter_low_ratio=False    # Incluir todos los samples
)
```

---

### 3.4 Fase 4: Optimizacion de Hiperparametros (Prioridad MEDIA)

**Archivo:** `src/models/range_predictor.py`

**Cambios:**
```python
'xgboost': {
    'n_estimators': 100,      # Reducir
    'max_depth': 3,           # Reducir de 5
    'learning_rate': 0.03,    # Reducir
    'subsample': 0.7,
    'colsample_bytree': 0.7,
    'min_child_weight': 10,   # Aumentar de 3
    'gamma': 0.5,             # Aumentar de 0.1
    'reg_alpha': 1.0,         # Aumentar de 0.1
    'reg_lambda': 10.0,       # Aumentar de 1.0
}
```

---

## 4. CRITERIOS DE EXITO

| Metrica | Valor Actual | Minimo Aceptable | Objetivo |
|---------|--------------|------------------|----------|
| R^2 (validacion) | -0.65 | > 0.05 | > 0.15 |
| MAE (normizado) | N/A | < 0.5 ATR | < 0.3 ATR |
| Direccion | 98% | > 60% | > 65% |
| Win Rate Backtest | 42% | > 50% | > 55% |

---

## 5. ORDEN DE EJECUCION

1. **S1-T2:** Implementar normalizacion de targets por ATR
2. **S1-T3:** Verificar no hay data leakage en features
3. **S1-T4a:** Reducir agresividad de sample weighting
4. **S1-T4b:** Ajustar hiperparametros XGBoost
5. **S1-T5:** Reentrenar modelos con correcciones
6. **S1-T6:** Validar R^2 > 0 en datos OOS

---

## 6. ARCHIVOS A MODIFICAR

| Archivo | Tipo de Cambio | Lineas Estimadas |
|---------|---------------|------------------|
| `src/data/targets.py` | Agregar normalizacion | +20 |
| `src/data/features.py` | Agregar features volatilidad | +50 |
| `src/training/sample_weighting.py` | Reducir agresividad | ~10 |
| `src/models/range_predictor.py` | Ajustar hiperparametros | ~15 |
| `scripts/train_symbol_timeframe_models.py` | Usar targets normalizados | ~20 |

---

## 7. RIESGOS

| Riesgo | Probabilidad | Mitigacion |
|--------|--------------|------------|
| R^2 sigue negativo | Media | Plan B: modelo baseline (media movil) |
| Normalizacion introduce leakage | Baja | Usar ATR shift(1) |
| Overfitting a nuevos hiperparametros | Media | Walk-forward validation |

---

**Reporte completado:** 2026-01-07
**Siguiente paso:** S1-T2 - Implementar normalizacion de targets