rckrdmrd c1b5081208 feat(ml): Complete FASE 11 - BTCUSD update and comprehensive documentation alignment

ML Engine Updates:
- Updated BTCUSD with Polygon API data (2024-2025): 215,699 new records
- Re-trained all ML models: Attention (R²: 0.223), Base, Metamodel (87.3% confidence)
- Backtest results: +176.71R profit with aggressive_filter strategy

Documentation Consolidation:
- Created docs/99-analisis/_MAP.md index with 13 new analysis documents
- Consolidated inventories: removed duplicates from orchestration/inventarios/
- Updated ML_INVENTORY.yml with BTCUSD metrics and training results
- Added execution reports: FASE11-BTCUSD, correction issues, alignment validation

Architecture & Integration:
- Updated all module documentation with NEXUS v3.4 frontmatter
- Fixed _MAP.md indexes across all folders
- Updated orchestration plans and traces

Files: 229 changed, 5064 insertions(+), 1872 deletions(-)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

2026-01-07 09:31:29 -06:00

29 KiB

Raw Blame History

id	title	type	project	version	updated_date
FEATURES-TARGETS-ML	Catálogo de Features y Targets - Machine Learning	Documentation	trading-platform	1.0.0	2026-01-04

Cat\u00e1logo de Features y Targets - Machine Learning

Versi\u00f3n: 1.0.0 Fecha: 2025-12-05 M\u00f3dulo: OQI-006-ml-signals Autor: Trading Strategist - Trading Platform

Introducci\u00f3n

Este documento define el cat\u00e1logo completo de features (variables de entrada) y targets (variables objetivo) utilizados en los modelos ML de Trading Platform.

Dimensiones Totales

Categor\u00eda	Features	Modelos que las usan
Base T\u00e9cnicos	21	Todos
AMD	25	AMDDetector, Range, TPSL
ICT	15	Range, TPSL, Orchestrator
SMC	12	Range, TPSL, Orchestrator
Liquidez	10	LiquidityHunter, TPSL
Microestructura	8	OrderFlow (opcional)
Total Base	~91 features	-

Features Base (21)

Categor\u00eda: Volatilidad (8)

Feature	F\u00f3rmula	Rango	Descripci\u00f3n
`volatility_5`	`close.pct_change().rolling(5).std()`	[0, ∞)	Volatilidad 5 periodos
`volatility_10`	`close.pct_change().rolling(10).std()`	[0, ∞)	Volatilidad 10 periodos
`volatility_20`	`close.pct_change().rolling(20).std()`	[0, ∞)	Volatilidad 20 periodos
`volatility_50`	`close.pct_change().rolling(50).std()`	[0, ∞)	Volatilidad 50 periodos
`atr_5`	`TrueRange.rolling(5).mean()`	[0, ∞)	Average True Range 5p
`atr_10`	`TrueRange.rolling(10).mean()`	[0, ∞)	Average True Range 10p
`atr_14`	`TrueRange.rolling(14).mean()`	[0, ∞)	Average True Range 14p (est\u00e1ndar)
`atr_ratio`	`atr_14 / atr_14.rolling(50).mean()`	[0, ∞)	Ratio ATR actual vs promedio

def calculate_volatility_features(df):
    features = {}
    for period in [5, 10, 20, 50]:
        features[f'volatility_{period}'] = df['close'].pct_change().rolling(period).std()

    # ATR
    high_low = df['high'] - df['low']
    high_close = np.abs(df['high'] - df['close'].shift())
    low_close = np.abs(df['low'] - df['close'].shift())
    true_range = pd.concat([high_low, high_close, low_close], axis=1).max(axis=1)

    for period in [5, 10, 14]:
        features[f'atr_{period}'] = true_range.rolling(period).mean()

    features['atr_ratio'] = features['atr_14'] / features['atr_14'].rolling(50).mean()

    return features

Categor\u00eda: Momentum (6)

Feature	F\u00f3rmula	Rango	Descripci\u00f3n
`momentum_5`	`close - close.shift(5)`	(-∞, ∞)	Momentum 5 periodos
`momentum_10`	`close - close.shift(10)`	(-∞, ∞)	Momentum 10 periodos
`momentum_20`	`close - close.shift(20)`	(-∞, ∞)	Momentum 20 periodos
`roc_5`	`(close / close.shift(5) - 1) * 100`	(-100, ∞)	Rate of Change 5p
`roc_10`	`(close / close.shift(10) - 1) * 100`	(-100, ∞)	Rate of Change 10p
`rsi_14`	Ver f\u00f3rmula RSI	[0, 100]	Relative Strength Index

def calculate_momentum_features(df):
    features = {}

    # Momentum
    for period in [5, 10, 20]:
        features[f'momentum_{period}'] = df['close'] - df['close'].shift(period)
        features[f'roc_{period}'] = (df['close'] / df['close'].shift(period) - 1) * 100

    # RSI
    delta = df['close'].diff()
    gain = (delta.where(delta > 0, 0)).rolling(14).mean()
    loss = (-delta.where(delta < 0, 0)).rolling(14).mean()
    rs = gain / loss
    features['rsi_14'] = 100 - (100 / (1 + rs))

    return features

Categor\u00eda: Medias M\u00f3viles (7)

Feature	F\u00f3rmula	Rango	Descripci\u00f3n
`sma_10`	`close.rolling(10).mean()`	[0, ∞)	Simple Moving Average 10
`sma_20`	`close.rolling(20).mean()`	[0, ∞)	Simple Moving Average 20
`sma_50`	`close.rolling(50).mean()`	[0, ∞)	Simple Moving Average 50
`sma_ratio_10`	`close / sma_10`	[0, ∞)	Ratio precio/SMA10
`sma_ratio_20`	`close / sma_20`	[0, ∞)	Ratio precio/SMA20
`sma_ratio_50`	`close / sma_50`	[0, ∞)	Ratio precio/SMA50
`sma_slope_20`	`sma_20.diff(5) / 5`	(-∞, ∞)	Pendiente de SMA20

def calculate_ma_features(df):
    features = {}

    for period in [10, 20, 50]:
        features[f'sma_{period}'] = df['close'].rolling(period).mean()
        features[f'sma_ratio_{period}'] = df['close'] / features[f'sma_{period}']

    features['sma_slope_20'] = features['sma_20'].diff(5) / 5

    return features

Features AMD (25)

Categor\u00eda: Price Action (10)

Feature	C\u00e1lculo	Rango	Uso
`range_ratio`	`(high - low) / high.rolling(20).mean()`	[0, ∞)	Compresi\u00f3n de rango
`range_ma`	`(high - low).rolling(20).mean()`	[0, ∞)	Promedio de rango
`hl_range_pct`	`(high - low) / close`	[0, 1]	Rango como % de precio
`body_size`	`abs(close - open) / (high - low)`	[0, 1]	Tama\u00f1o del cuerpo
`upper_wick`	`(high - max(close, open)) / (high - low)`	[0, 1]	Mecha superior
`lower_wick`	`(min(close, open) - low) / (high - low)`	[0, 1]	Mecha inferior
`buying_pressure`	`(close - low) / (high - low)`	[0, 1]	Presi\u00f3n compradora
`selling_pressure`	`(high - close) / (high - low)`	[0, 1]	Presi\u00f3n vendedora
`close_position`	`(close - low) / (high - low)`	[0, 1]	Posici\u00f3n del cierre
`range_expansion`	`(high - low) / (high - low).shift(1)`	[0, ∞)	Expansi\u00f3n de rango

Categor\u00eda: Volumen (8)

Feature	C\u00e1lculo	Descripci\u00f3n
`volume_ratio`	`volume / volume.rolling(20).mean()`	Volumen vs promedio
`volume_trend`	`volume.rolling(10).mean() - volume.rolling(30).mean()`	Tendencia de volumen
`volume_ma`	`volume.rolling(20).mean()`	Volumen promedio
`volume_spike_count`	`(volume > volume_ma * 2).rolling(30).sum()`	Spikes recientes
`obv`	Ver c\u00e1lculo OBV	On-Balance Volume
`obv_slope`	`obv.diff(5) / 5`	Tendencia de OBV
`vwap_distance`	`(close - vwap) / close`	Distancia a VWAP
`volume_on_up`	Ver c\u00e1lculo	Volumen en subidas

def calculate_volume_features(df):
    features = {}

    features['volume_ratio'] = df['volume'] / df['volume'].rolling(20).mean()
    features['volume_trend'] = df['volume'].rolling(10).mean() - df['volume'].rolling(30).mean()
    features['volume_ma'] = df['volume'].rolling(20).mean()
    features['volume_spike_count'] = (df['volume'] > features['volume_ma'] * 2).rolling(30).sum()

    # OBV
    obv = (df['volume'] * ((df['close'] > df['close'].shift(1)).astype(int) * 2 - 1)).cumsum()
    features['obv'] = obv
    features['obv_slope'] = obv.diff(5) / 5

    # VWAP
    vwap = (df['close'] * df['volume']).cumsum() / df['volume'].cumsum()
    features['vwap_distance'] = (df['close'] - vwap) / df['close']

    return features

Categor\u00eda: Market Structure (7)

Feature	C\u00e1lculo	Uso
`higher_highs_count`	`(high > high.shift(1)).rolling(10).sum()`	Cuenta HH
`higher_lows_count`	`(low > low.shift(1)).rolling(10).sum()`	Cuenta HL
`lower_highs_count`	`(high < high.shift(1)).rolling(10).sum()`	Cuenta LH
`lower_lows_count`	`(low < low.shift(1)).rolling(10).sum()`	Cuenta LL
`swing_high_distance`	`(swing_high_20 - close) / close`	Distancia a swing high
`swing_low_distance`	`(close - swing_low_20) / close`	Distancia a swing low
`market_structure_score`	Ver c\u00e1lculo	Score de estructura

def calculate_market_structure_features(df):
    features = {}

    features['higher_highs_count'] = (df['high'] > df['high'].shift(1)).rolling(10).sum()
    features['higher_lows_count'] = (df['low'] > df['low'].shift(1)).rolling(10).sum()
    features['lower_highs_count'] = (df['high'] < df['high'].shift(1)).rolling(10).sum()
    features['lower_lows_count'] = (df['low'] < df['low'].shift(1)).rolling(10).sum()

    swing_high = df['high'].rolling(20).max()
    swing_low = df['low'].rolling(20).min()

    features['swing_high_distance'] = (swing_high - df['close']) / df['close']
    features['swing_low_distance'] = (df['close'] - swing_low) / df['close']

    # Market structure score (-1 bearish, +1 bullish)
    bullish_score = (features['higher_highs_count'] + features['higher_lows_count']) / 20
    bearish_score = (features['lower_highs_count'] + features['lower_lows_count']) / 20
    features['market_structure_score'] = bullish_score - bearish_score

    return features

Features ICT (15)

Categor\u00eda: OTE & Fibonacci (5)

Feature	C\u00e1lculo	Rango	Descripci\u00f3n
`ote_position`	`(close - swing_low) / (swing_high - swing_low)`	[0, 1]	Posici\u00f3n en rango
`in_discount_zone`	`1 if ote_position < 0.38 else 0`	{0, 1}	En zona discount
`in_premium_zone`	`1 if ote_position > 0.62 else 0`	{0, 1}	En zona premium
`in_ote_buy_zone`	`1 if 0.62 <= ote_position <= 0.79 else 0`	{0, 1}	En OTE compra
`fib_distance_50`	`abs(ote_position - 0.5)`	[0, 0.5]	Distancia a equilibrio

Categor\u00eda: Killzones & Timing (5)

Feature	C\u00e1lculo	Descripci\u00f3n
`is_london_kz`	Basado en hora EST	Killzone London
`is_ny_kz`	Basado en hora EST	Killzone NY
`is_asian_kz`	Basado en hora EST	Killzone Asian
`session_strength`	0-1 seg\u00fan killzone	Fuerza de sesi\u00f3n
`session_overlap`	Detecci\u00f3n de overlap	Overlap London/NY

def calculate_ict_features(df):
    features = {}

    # OTE position
    swing_high = df['high'].rolling(50).max()
    swing_low = df['low'].rolling(50).min()
    range_size = swing_high - swing_low

    features['ote_position'] = (df['close'] - swing_low) / (range_size + 1e-8)
    features['in_discount_zone'] = (features['ote_position'] < 0.38).astype(int)
    features['in_premium_zone'] = (features['ote_position'] > 0.62).astype(int)
    features['in_ote_buy_zone'] = (
        (features['ote_position'] >= 0.62) & (features['ote_position'] <= 0.79)
    ).astype(int)
    features['fib_distance_50'] = np.abs(features['ote_position'] - 0.5)

    # Killzones
    hour_est = df.index.tz_convert('America/New_York').hour
    features['is_london_kz'] = ((hour_est >= 2) & (hour_est < 5)).astype(int)
    features['is_ny_kz'] = ((hour_est >= 8) & (hour_est < 11)).astype(int)
    features['is_asian_kz'] = ((hour_est >= 20) | (hour_est < 0)).astype(int)

    # Session strength
    features['session_strength'] = 0.1  # default
    features.loc[features['is_london_kz'] == 1, 'session_strength'] = 0.9
    features.loc[features['is_ny_kz'] == 1, 'session_strength'] = 1.0
    features.loc[features['is_asian_kz'] == 1, 'session_strength'] = 0.3

    # Session overlap
    features['session_overlap'] = (
        (hour_est >= 10) & (hour_est < 12)
    ).astype(int)  # London close + NY open

    return features

Categor\u00eda: Ranges (5)

Feature	C\u00e1lculo	Descripci\u00f3n
`weekly_range_position`	Posici\u00f3n en rango semanal	0-1
`daily_range_position`	Posici\u00f3n en rango diario	0-1
`weekly_range_size`	High - Low semanal	Absoluto
`daily_range_size`	High - Low diario	Absoluto
`range_expansion_daily`	Ratio range actual/promedio	>1 = expansi\u00f3n

Features SMC (12)

Categor\u00eda: Structure Breaks (6)

Feature	C\u00e1lculo	Uso
`choch_bullish_count`	Count en ventana 30	CHOCHs alcistas
`choch_bearish_count`	Count en ventana 30	CHOCHs bajistas
`bos_bullish_count`	Count en ventana 30	BOS alcistas
`bos_bearish_count`	Count en ventana 30	BOS bajistas
`choch_recency`	Bars desde \u00faltimo CHOCH	0 = muy reciente
`bos_recency`	Bars desde \u00faltimo BOS	0 = muy reciente

def calculate_smc_features(df):
    features = {}

    # Detectar CHOCHs y BOS
    choch_signals = detect_choch(df, window=20)
    bos_signals = detect_bos(df, window=20)

    # Contar por tipo
    features['choch_bullish_count'] = count_signals_in_window(
        choch_signals, 'bullish_choch', window=30
    )
    features['choch_bearish_count'] = count_signals_in_window(
        choch_signals, 'bearish_choch', window=30
    )
    features['bos_bullish_count'] = count_signals_in_window(
        bos_signals, 'bullish_bos', window=30
    )
    features['bos_bearish_count'] = count_signals_in_window(
        bos_signals, 'bearish_bos', window=30
    )

    # Recency
    features['choch_recency'] = bars_since_last_signal(choch_signals)
    features['bos_recency'] = bars_since_last_signal(bos_signals)

    return features

Categor\u00eda: Displacement & Flow (6)

Feature	C\u00e1lculo	Descripci\u00f3n
`displacement_strength`	Movimiento / ATR	Fuerza de displacement
`displacement_direction`	1=alcista, -1=bajista, 0=neutral	Direcci\u00f3n
`displacement_recency`	Bars desde \u00faltimo	Recencia
`inducement_count`	Count en ventana 20	Inducements detectados
`inducement_bullish`	Count bullish inducements	Trampas alcistas
`inducement_bearish`	Count bearish inducements	Trampas bajistas

Features de Liquidez (10)

Feature	C\u00e1lculo	Rango	Descripci\u00f3n
`bsl_distance`	`(bsl_level - close) / close`	[0, ∞)	Distancia a BSL
`ssl_distance`	`(close - ssl_level) / close`	[0, ∞)	Distancia a SSL
`bsl_density`	Count de BSL levels cercanos	[0, ∞)	Densidad de BSL
`ssl_density`	Count de SSL levels cercanos	[0, ∞)	Densidad de SSL
`bsl_strength`	Volumen en BSL level	[0, ∞)	Fuerza del BSL
`ssl_strength`	Volumen en SSL level	[0, ∞)	Fuerza del SSL
`liquidity_grab_count`	Count sweeps recientes	[0, ∞)	Sweeps recientes
`bsl_sweep_recent`	1 si sweep reciente	{0, 1}	BSL swept
`ssl_sweep_recent`	1 si sweep reciente	{0, 1}	SSL swept
`near_liquidity`	1 si <1% de level	{0, 1}	Cerca de liquidez

def calculate_liquidity_features(df, lookback=20):
    features = {}

    # Identificar liquidity pools
    swing_highs = df['high'].rolling(lookback, center=True).max()
    swing_lows = df['low'].rolling(lookback, center=True).min()

    # BSL (Buy Side Liquidity)
    bsl_levels = find_liquidity_levels(df, 'high', lookback)
    features['bsl_distance'] = (bsl_levels['nearest'] - df['close']) / df['close']
    features['bsl_density'] = bsl_levels['density']
    features['bsl_strength'] = bsl_levels['strength']

    # SSL (Sell Side Liquidity)
    ssl_levels = find_liquidity_levels(df, 'low', lookback)
    features['ssl_distance'] = (df['close'] - ssl_levels['nearest']) / df['close']
    features['ssl_density'] = ssl_levels['density']
    features['ssl_strength'] = ssl_levels['strength']

    # Sweeps
    sweeps = detect_liquidity_sweeps(df, window=30)
    features['liquidity_grab_count'] = len(sweeps)
    features['bsl_sweep_recent'] = any(s['type'] == 'bsl' for s in sweeps[-5:])
    features['ssl_sweep_recent'] = any(s['type'] == 'ssl' for s in sweeps[-5:])

    # Proximity
    features['near_liquidity'] = (
        (features['bsl_distance'] < 0.01) | (features['ssl_distance'] < 0.01)
    ).astype(int)

    return features

Features de Microestructura (8)

Nota: Requiere datos de volumen granular o tick data

Feature	C\u00e1lculo	Descripci\u00f3n
`volume_delta`	`buy_volume - sell_volume`	Delta de volumen
`cumulative_volume_delta`	CVD acumulado	CVD
`cvd_slope`	`cvd.diff(5) / 5`	Tendencia CVD
`tick_imbalance`	`(upticks - downticks) / total_ticks`	Imbalance de ticks
`large_orders_count`	Count de \u00f3rdenes grandes	Actividad institucional
`order_flow_imbalance`	Ratio buy/sell	-1 a +1
`poc_distance`	Distancia a Point of Control	Distancia a POC
`hvn_proximity`	Distancia a High Volume Node	Zona de alto volumen

def calculate_microstructure_features(df):
    """
    Requiere datos extendidos: buy_volume, sell_volume, tick_data
    """
    features = {}

    if 'buy_volume' in df.columns and 'sell_volume' in df.columns:
        features['volume_delta'] = df['buy_volume'] - df['sell_volume']
        features['cumulative_volume_delta'] = features['volume_delta'].cumsum()
        features['cvd_slope'] = features['cumulative_volume_delta'].diff(5) / 5

        total_volume = df['buy_volume'] + df['sell_volume']
        features['order_flow_imbalance'] = features['volume_delta'] / (total_volume + 1e-8)

        # Large orders
        threshold = df['volume'].rolling(20).mean() * 2
        features['large_orders_count'] = (df['volume'] > threshold).rolling(30).sum()

    # Volume profile
    volume_profile = calculate_volume_profile(df, bins=50)
    features['poc_distance'] = (df['close'] - volume_profile['poc']) / df['close']

    return features

Targets para Modelos

Target 1: AMD Phase (AMDDetector)

TARGET_AMD_PHASE = {
    0: 'neutral',
    1: 'accumulation',
    2: 'manipulation',
    3: 'distribution'
}

def label_amd_phase(df, i, forward_window=20):
    """
    Ver documentaci\u00f3n ESTRATEGIA-AMD-COMPLETA.md
    """
    # Implementaci\u00f3n completa en documento AMD
    pass

Target 2: Delta High/Low (RangePredictor)

# Targets de regresi\u00f3n
TARGETS_RANGE = {
    'delta_high_15m': float,   # Predicci\u00f3n continua
    'delta_low_15m': float,
    'delta_high_1h': float,
    'delta_low_1h': float,

    # Targets de clasificaci\u00f3n (bins)
    'bin_high_15m': int,      # 0-3
    'bin_low_15m': int,
    'bin_high_1h': int,
    'bin_low_1h': int
}

def calculate_range_targets(df, horizons={'15m': 3, '1h': 12}):
    targets = {}
    atr = calculate_atr(df, 14)

    for name, periods in horizons.items():
        # Delta high
        targets[f'delta_high_{name}'] = (
            df['high'].rolling(periods).max().shift(-periods) - df['close']
        ) / df['close']

        # Delta low
        targets[f'delta_low_{name}'] = (
            df['close'] - df['low'].rolling(periods).min().shift(-periods)
        ) / df['close']

        # Bins (volatilidad normalizada por ATR)
        def to_bin(delta_series):
            ratio = delta_series / atr
            bins = pd.cut(
                ratio,
                bins=[-np.inf, 0.3, 0.7, 1.2, np.inf],
                labels=[0, 1, 2, 3]
            )
            return bins.astype(float)

        targets[f'bin_high_{name}'] = to_bin(targets[f'delta_high_{name}'])
        targets[f'bin_low_{name}'] = to_bin(targets[f'delta_low_{name}'])

    return pd.DataFrame(targets)

Target 3: TP vs SL (TPSLClassifier)

TARGETS_TPSL = {
    'tp_first_15m_rr_2_1': int,  # 0 o 1
    'tp_first_15m_rr_3_1': int,
    'tp_first_1h_rr_2_1': int,
    'tp_first_1h_rr_3_1': int
}

def calculate_tpsl_targets(df, rr_configs):
    """
    Simula si TP se alcanza antes que SL
    """
    targets = {}
    atr = calculate_atr(df, 14)

    for rr in rr_configs:
        sl_dist = atr * rr['sl_atr_multiple']
        tp_dist = atr * rr['tp_atr_multiple']

        def check_tp_first(i, horizon_bars):
            if i + horizon_bars >= len(df):
                return np.nan

            entry_price = df['close'].iloc[i]
            sl_price = entry_price - sl_dist.iloc[i]
            tp_price = entry_price + tp_dist.iloc[i]

            future = df.iloc[i+1:i+horizon_bars+1]

            for _, row in future.iterrows():
                if row['low'] <= sl_price:
                    return 0  # SL hit first
                elif row['high'] >= tp_price:
                    return 1  # TP hit first

            return np.nan  # Neither hit

        for horizon_name, horizon_bars in [('15m', 3), ('1h', 12)]:
            target_name = f'tp_first_{horizon_name}_{rr["name"]}'
            targets[target_name] = [
                check_tp_first(i, horizon_bars) for i in range(len(df))
            ]

    return pd.DataFrame(targets)

Target 4: Liquidity Sweep (LiquidityHunter)

TARGETS_LIQUIDITY = {
    'bsl_sweep': int,      # 0 o 1
    'ssl_sweep': int,
    'any_sweep': int,
    'sweep_timing': int    # Bars hasta sweep
}

def label_liquidity_sweep(df, i, forward_window=10):
    """
    Etiqueta si habr\u00e1 liquidity sweep
    """
    if i + forward_window >= len(df):
        return {'bsl_sweep': np.nan, 'ssl_sweep': np.nan}

    swing_high = df['high'].iloc[max(0, i-20):i].max()
    swing_low = df['low'].iloc[max(0, i-20):i].min()

    future = df.iloc[i:i+forward_window]

    # BSL sweep (sweep of highs)
    bsl_swept = (future['high'] >= swing_high * 1.005).any()

    # SSL sweep (sweep of lows)
    ssl_swept = (future['low'] <= swing_low * 0.995).any()

    # Timing
    if bsl_swept:
        sweep_timing = (future['high'] >= swing_high * 1.005).idxmax()
    elif ssl_swept:
        sweep_timing = (future['low'] <= swing_low * 0.995).idxmax()
    else:
        sweep_timing = np.nan

    return {
        'bsl_sweep': 1 if bsl_swept else 0,
        'ssl_sweep': 1 if ssl_swept else 0,
        'any_sweep': 1 if (bsl_swept or ssl_swept) else 0,
        'sweep_timing': sweep_timing
    }

Target 5: Order Flow (OrderFlowAnalyzer)

TARGETS_ORDER_FLOW = {
    'flow_type': int,          # 0=neutral, 1=accumulation, 2=distribution
    'institutional_activity': float  # 0-1 score
}

def label_order_flow(df, i, forward_window=50):
    """
    Basado en CVD y large orders
    """
    if 'cumulative_volume_delta' not in df.columns:
        return {'flow_type': 0}

    current_cvd = df['cumulative_volume_delta'].iloc[i]
    future_cvd = df['cumulative_volume_delta'].iloc[i+forward_window]

    cvd_change = future_cvd - current_cvd

    # Large orders in window
    large_orders = df['large_orders_count'].iloc[i:i+forward_window].sum()

    if cvd_change > 0 and large_orders > 5:
        flow_type = 1  # accumulation
    elif cvd_change < 0 and large_orders > 5:
        flow_type = 2  # distribution
    else:
        flow_type = 0  # neutral

    institutional_activity = min(1.0, large_orders / 10)

    return {
        'flow_type': flow_type,
        'institutional_activity': institutional_activity
    }

Feature Engineering Pipeline

Pipeline Completo

class FeatureEngineeringPipeline:
    """
    Pipeline completo de feature engineering
    """

    def __init__(self, config=None):
        self.config = config or {}
        self.scalers = {}

    def transform(self, df):
        """
        Transforma OHLCV raw a features completos
        """
        features = pd.DataFrame(index=df.index)

        # 1. Base features
        print("Extracting base features...")
        base = self._extract_base_features(df)
        features = pd.concat([features, base], axis=1)

        # 2. AMD features
        print("Extracting AMD features...")
        amd = self._extract_amd_features(df)
        features = pd.concat([features, amd], axis=1)

        # 3. ICT features
        print("Extracting ICT features...")
        ict = self._extract_ict_features(df)
        features = pd.concat([features, ict], axis=1)

        # 4. SMC features
        print("Extracting SMC features...")
        smc = self._extract_smc_features(df)
        features = pd.concat([features, smc], axis=1)

        # 5. Liquidity features
        print("Extracting liquidity features...")
        liquidity = self._extract_liquidity_features(df)
        features = pd.concat([features, liquidity], axis=1)

        # 6. Microstructure (si disponible)
        if 'buy_volume' in df.columns:
            print("Extracting microstructure features...")
            micro = self._extract_microstructure_features(df)
            features = pd.concat([features, micro], axis=1)

        # 7. Scaling
        print("Scaling features...")
        features_scaled = self._scale_features(features)

        # 8. Handle missing values
        features_scaled = features_scaled.fillna(method='ffill').fillna(0)

        return features_scaled

    def _extract_base_features(self, df):
        """Extrae features base (21)"""
        features = {}

        # Volatilidad
        features.update(calculate_volatility_features(df))

        # Momentum
        features.update(calculate_momentum_features(df))

        # Moving averages
        features.update(calculate_ma_features(df))

        return pd.DataFrame(features)

    def _scale_features(self, features):
        """Escala features usando RobustScaler"""
        from sklearn.preprocessing import RobustScaler

        if not self.scalers:
            # Fit scalers
            for col in features.columns:
                self.scalers[col] = RobustScaler()
                features[col] = self.scalers[col].fit_transform(
                    features[col].values.reshape(-1, 1)
                )
        else:
            # Transform with fitted scalers
            for col in features.columns:
                if col in self.scalers:
                    features[col] = self.scalers[col].transform(
                        features[col].values.reshape(-1, 1)
                    )

        return features

Uso del Pipeline

# Inicializar
pipeline = FeatureEngineeringPipeline()

# Transformar datos
df_raw = load_ohlcv_data('BTCUSDT', '5m')
features = pipeline.transform(df_raw)

print(f"Features shape: {features.shape}")
print(f"Features: {features.columns.tolist()}")

# Features ready for ML models
X = features.values

Consideraciones T\u00e9cnicas

1. Prevenci\u00f3n de Look-Ahead Bias

IMPORTANTE: Nunca usar datos futuros para calcular features

# ✅ CORRECTO
sma_20 = df['close'].rolling(20).mean()

# ❌ INCORRECTO
sma_20 = df['close'].rolling(20, center=True).mean()  # Usa datos futuros!

2. Handling Missing Values

def handle_missing(features):
    """
    Estrategia de imputaci\u00f3n
    """
    # 1. Forward fill (usar \u00faltimo valor conocido)
    features = features.fillna(method='ffill')

    # 2. Si a\u00fan hay NaNs al inicio, usar 0
    features = features.fillna(0)

    # 3. Alternativa: usar median
    # features = features.fillna(features.median())

    return features

3. Feature Scaling

from sklearn.preprocessing import RobustScaler, StandardScaler, MinMaxScaler

# Price-based features → RobustScaler (maneja outliers)
price_scaler = RobustScaler()

# Indicators → StandardScaler
indicator_scaler = StandardScaler()

# Ratios/percentages → MinMaxScaler
ratio_scaler = MinMaxScaler(feature_range=(0, 1))

4. Feature Selection

def select_important_features(X, y, model, top_n=50):
    """
    Selecciona features m\u00e1s importantes
    """
    # Train model
    model.fit(X, y)

    # Get importance
    importance = pd.DataFrame({
        'feature': feature_names,
        'importance': model.feature_importances_
    }).sort_values('importance', ascending=False)

    # Select top N
    selected_features = importance.head(top_n)['feature'].tolist()

    return selected_features

5. Validaci\u00f3n Temporal

def temporal_validation_split(df, train_pct=0.7, val_pct=0.15):
    """
    Split temporal estricto (sin shuffle)
    """
    n = len(df)
    train_end = int(n * train_pct)
    val_end = int(n * (train_pct + val_pct))

    df_train = df.iloc[:train_end]
    df_val = df.iloc[train_end:val_end]
    df_test = df.iloc[val_end:]

    # Verificar no hay overlap
    assert df_train.index[-1] < df_val.index[0]
    assert df_val.index[-1] < df_test.index[0]

    return df_train, df_val, df_test

Resumen de Dimensiones

Categor\u00eda	Features	Modelos
Base T\u00e9cnicos	21	Todos
AMD	25	AMD, Range, TPSL
ICT	15	Range, TPSL
SMC	12	Range, TPSL
Liquidez	10	Liquidity, TPSL
Microestructura	8	OrderFlow
TOTAL	91 features	-

Documento Generado: 2025-12-05 Pr\u00f3xima Revisi\u00f3n: 2025-Q1 Contacto: ml-engineering@trading.ai

29 KiB Raw Blame History