# Cat\u00e1logo de Features y Targets - Machine Learning **Versi\u00f3n:** 1.0.0 **Fecha:** 2025-12-05 **M\u00f3dulo:** OQI-006-ml-signals **Autor:** Trading Strategist - OrbiQuant IA --- ## Tabla de Contenidos 1. [Introducci\u00f3n](#introducci\u00f3n) 2. [Features Base (21)](#features-base-21) 3. [Features AMD (25)](#features-amd-25) 4. [Features ICT (15)](#features-ict-15) 5. [Features SMC (12)](#features-smc-12) 6. [Features de Liquidez (10)](#features-de-liquidez-10) 7. [Features de Microestructura (8)](#features-de-microestructura-8) 8. [Targets para Modelos](#targets-para-modelos) 9. [Feature Engineering Pipeline](#feature-engineering-pipeline) 10. [Consideraciones T\u00e9cnicas](#consideraciones-t\u00e9cnicas) --- ## Introducci\u00f3n Este documento define el cat\u00e1logo completo de features (variables de entrada) y targets (variables objetivo) utilizados en los modelos ML de OrbiQuant IA. ### Dimensiones Totales | Categor\u00eda | Features | Modelos que las usan | |-----------|----------|---------------------| | **Base T\u00e9cnicos** | 21 | Todos | | **AMD** | 25 | AMDDetector, Range, TPSL | | **ICT** | 15 | Range, TPSL, Orchestrator | | **SMC** | 12 | Range, TPSL, Orchestrator | | **Liquidez** | 10 | LiquidityHunter, TPSL | | **Microestructura** | 8 | OrderFlow (opcional) | | **Total Base** | ~91 features | - | --- ## Features Base (21) ### Categor\u00eda: Volatilidad (8) | Feature | F\u00f3rmula | Rango | Descripci\u00f3n | |---------|---------|-------|---------------| | `volatility_5` | `close.pct_change().rolling(5).std()` | [0, ∞) | Volatilidad 5 periodos | | `volatility_10` | `close.pct_change().rolling(10).std()` | [0, ∞) | Volatilidad 10 periodos | | `volatility_20` | `close.pct_change().rolling(20).std()` | [0, ∞) | Volatilidad 20 periodos | | `volatility_50` | `close.pct_change().rolling(50).std()` | [0, ∞) | Volatilidad 50 periodos | | `atr_5` | `TrueRange.rolling(5).mean()` | [0, ∞) | Average True Range 5p | | `atr_10` | `TrueRange.rolling(10).mean()` | [0, ∞) | Average True Range 10p | | `atr_14` | `TrueRange.rolling(14).mean()` | [0, ∞) | Average True Range 14p (est\u00e1ndar) | | `atr_ratio` | `atr_14 / atr_14.rolling(50).mean()` | [0, ∞) | Ratio ATR actual vs promedio | ```python def calculate_volatility_features(df): features = {} for period in [5, 10, 20, 50]: features[f'volatility_{period}'] = df['close'].pct_change().rolling(period).std() # ATR high_low = df['high'] - df['low'] high_close = np.abs(df['high'] - df['close'].shift()) low_close = np.abs(df['low'] - df['close'].shift()) true_range = pd.concat([high_low, high_close, low_close], axis=1).max(axis=1) for period in [5, 10, 14]: features[f'atr_{period}'] = true_range.rolling(period).mean() features['atr_ratio'] = features['atr_14'] / features['atr_14'].rolling(50).mean() return features ``` ### Categor\u00eda: Momentum (6) | Feature | F\u00f3rmula | Rango | Descripci\u00f3n | |---------|---------|-------|---------------| | `momentum_5` | `close - close.shift(5)` | (-∞, ∞) | Momentum 5 periodos | | `momentum_10` | `close - close.shift(10)` | (-∞, ∞) | Momentum 10 periodos | | `momentum_20` | `close - close.shift(20)` | (-∞, ∞) | Momentum 20 periodos | | `roc_5` | `(close / close.shift(5) - 1) * 100` | (-100, ∞) | Rate of Change 5p | | `roc_10` | `(close / close.shift(10) - 1) * 100` | (-100, ∞) | Rate of Change 10p | | `rsi_14` | Ver f\u00f3rmula RSI | [0, 100] | Relative Strength Index | ```python def calculate_momentum_features(df): features = {} # Momentum for period in [5, 10, 20]: features[f'momentum_{period}'] = df['close'] - df['close'].shift(period) features[f'roc_{period}'] = (df['close'] / df['close'].shift(period) - 1) * 100 # RSI delta = df['close'].diff() gain = (delta.where(delta > 0, 0)).rolling(14).mean() loss = (-delta.where(delta < 0, 0)).rolling(14).mean() rs = gain / loss features['rsi_14'] = 100 - (100 / (1 + rs)) return features ``` ### Categor\u00eda: Medias M\u00f3viles (7) | Feature | F\u00f3rmula | Rango | Descripci\u00f3n | |---------|---------|-------|---------------| | `sma_10` | `close.rolling(10).mean()` | [0, ∞) | Simple Moving Average 10 | | `sma_20` | `close.rolling(20).mean()` | [0, ∞) | Simple Moving Average 20 | | `sma_50` | `close.rolling(50).mean()` | [0, ∞) | Simple Moving Average 50 | | `sma_ratio_10` | `close / sma_10` | [0, ∞) | Ratio precio/SMA10 | | `sma_ratio_20` | `close / sma_20` | [0, ∞) | Ratio precio/SMA20 | | `sma_ratio_50` | `close / sma_50` | [0, ∞) | Ratio precio/SMA50 | | `sma_slope_20` | `sma_20.diff(5) / 5` | (-∞, ∞) | Pendiente de SMA20 | ```python def calculate_ma_features(df): features = {} for period in [10, 20, 50]: features[f'sma_{period}'] = df['close'].rolling(period).mean() features[f'sma_ratio_{period}'] = df['close'] / features[f'sma_{period}'] features['sma_slope_20'] = features['sma_20'].diff(5) / 5 return features ``` --- ## Features AMD (25) ### Categor\u00eda: Price Action (10) | Feature | C\u00e1lculo | Rango | Uso | |---------|---------|-------|-----| | `range_ratio` | `(high - low) / high.rolling(20).mean()` | [0, ∞) | Compresi\u00f3n de rango | | `range_ma` | `(high - low).rolling(20).mean()` | [0, ∞) | Promedio de rango | | `hl_range_pct` | `(high - low) / close` | [0, 1] | Rango como % de precio | | `body_size` | `abs(close - open) / (high - low)` | [0, 1] | Tama\u00f1o del cuerpo | | `upper_wick` | `(high - max(close, open)) / (high - low)` | [0, 1] | Mecha superior | | `lower_wick` | `(min(close, open) - low) / (high - low)` | [0, 1] | Mecha inferior | | `buying_pressure` | `(close - low) / (high - low)` | [0, 1] | Presi\u00f3n compradora | | `selling_pressure` | `(high - close) / (high - low)` | [0, 1] | Presi\u00f3n vendedora | | `close_position` | `(close - low) / (high - low)` | [0, 1] | Posici\u00f3n del cierre | | `range_expansion` | `(high - low) / (high - low).shift(1)` | [0, ∞) | Expansi\u00f3n de rango | ### Categor\u00eda: Volumen (8) | Feature | C\u00e1lculo | Descripci\u00f3n | |---------|---------|---------------| | `volume_ratio` | `volume / volume.rolling(20).mean()` | Volumen vs promedio | | `volume_trend` | `volume.rolling(10).mean() - volume.rolling(30).mean()` | Tendencia de volumen | | `volume_ma` | `volume.rolling(20).mean()` | Volumen promedio | | `volume_spike_count` | `(volume > volume_ma * 2).rolling(30).sum()` | Spikes recientes | | `obv` | Ver c\u00e1lculo OBV | On-Balance Volume | | `obv_slope` | `obv.diff(5) / 5` | Tendencia de OBV | | `vwap_distance` | `(close - vwap) / close` | Distancia a VWAP | | `volume_on_up` | Ver c\u00e1lculo | Volumen en subidas | ```python def calculate_volume_features(df): features = {} features['volume_ratio'] = df['volume'] / df['volume'].rolling(20).mean() features['volume_trend'] = df['volume'].rolling(10).mean() - df['volume'].rolling(30).mean() features['volume_ma'] = df['volume'].rolling(20).mean() features['volume_spike_count'] = (df['volume'] > features['volume_ma'] * 2).rolling(30).sum() # OBV obv = (df['volume'] * ((df['close'] > df['close'].shift(1)).astype(int) * 2 - 1)).cumsum() features['obv'] = obv features['obv_slope'] = obv.diff(5) / 5 # VWAP vwap = (df['close'] * df['volume']).cumsum() / df['volume'].cumsum() features['vwap_distance'] = (df['close'] - vwap) / df['close'] return features ``` ### Categor\u00eda: Market Structure (7) | Feature | C\u00e1lculo | Uso | |---------|---------|-----| | `higher_highs_count` | `(high > high.shift(1)).rolling(10).sum()` | Cuenta HH | | `higher_lows_count` | `(low > low.shift(1)).rolling(10).sum()` | Cuenta HL | | `lower_highs_count` | `(high < high.shift(1)).rolling(10).sum()` | Cuenta LH | | `lower_lows_count` | `(low < low.shift(1)).rolling(10).sum()` | Cuenta LL | | `swing_high_distance` | `(swing_high_20 - close) / close` | Distancia a swing high | | `swing_low_distance` | `(close - swing_low_20) / close` | Distancia a swing low | | `market_structure_score` | Ver c\u00e1lculo | Score de estructura | ```python def calculate_market_structure_features(df): features = {} features['higher_highs_count'] = (df['high'] > df['high'].shift(1)).rolling(10).sum() features['higher_lows_count'] = (df['low'] > df['low'].shift(1)).rolling(10).sum() features['lower_highs_count'] = (df['high'] < df['high'].shift(1)).rolling(10).sum() features['lower_lows_count'] = (df['low'] < df['low'].shift(1)).rolling(10).sum() swing_high = df['high'].rolling(20).max() swing_low = df['low'].rolling(20).min() features['swing_high_distance'] = (swing_high - df['close']) / df['close'] features['swing_low_distance'] = (df['close'] - swing_low) / df['close'] # Market structure score (-1 bearish, +1 bullish) bullish_score = (features['higher_highs_count'] + features['higher_lows_count']) / 20 bearish_score = (features['lower_highs_count'] + features['lower_lows_count']) / 20 features['market_structure_score'] = bullish_score - bearish_score return features ``` --- ## Features ICT (15) ### Categor\u00eda: OTE & Fibonacci (5) | Feature | C\u00e1lculo | Rango | Descripci\u00f3n | |---------|---------|-------|---------------| | `ote_position` | `(close - swing_low) / (swing_high - swing_low)` | [0, 1] | Posici\u00f3n en rango | | `in_discount_zone` | `1 if ote_position < 0.38 else 0` | {0, 1} | En zona discount | | `in_premium_zone` | `1 if ote_position > 0.62 else 0` | {0, 1} | En zona premium | | `in_ote_buy_zone` | `1 if 0.62 <= ote_position <= 0.79 else 0` | {0, 1} | En OTE compra | | `fib_distance_50` | `abs(ote_position - 0.5)` | [0, 0.5] | Distancia a equilibrio | ### Categor\u00eda: Killzones & Timing (5) | Feature | C\u00e1lculo | Descripci\u00f3n | |---------|---------|---------------| | `is_london_kz` | Basado en hora EST | Killzone London | | `is_ny_kz` | Basado en hora EST | Killzone NY | | `is_asian_kz` | Basado en hora EST | Killzone Asian | | `session_strength` | 0-1 seg\u00fan killzone | Fuerza de sesi\u00f3n | | `session_overlap` | Detecci\u00f3n de overlap | Overlap London/NY | ```python def calculate_ict_features(df): features = {} # OTE position swing_high = df['high'].rolling(50).max() swing_low = df['low'].rolling(50).min() range_size = swing_high - swing_low features['ote_position'] = (df['close'] - swing_low) / (range_size + 1e-8) features['in_discount_zone'] = (features['ote_position'] < 0.38).astype(int) features['in_premium_zone'] = (features['ote_position'] > 0.62).astype(int) features['in_ote_buy_zone'] = ( (features['ote_position'] >= 0.62) & (features['ote_position'] <= 0.79) ).astype(int) features['fib_distance_50'] = np.abs(features['ote_position'] - 0.5) # Killzones hour_est = df.index.tz_convert('America/New_York').hour features['is_london_kz'] = ((hour_est >= 2) & (hour_est < 5)).astype(int) features['is_ny_kz'] = ((hour_est >= 8) & (hour_est < 11)).astype(int) features['is_asian_kz'] = ((hour_est >= 20) | (hour_est < 0)).astype(int) # Session strength features['session_strength'] = 0.1 # default features.loc[features['is_london_kz'] == 1, 'session_strength'] = 0.9 features.loc[features['is_ny_kz'] == 1, 'session_strength'] = 1.0 features.loc[features['is_asian_kz'] == 1, 'session_strength'] = 0.3 # Session overlap features['session_overlap'] = ( (hour_est >= 10) & (hour_est < 12) ).astype(int) # London close + NY open return features ``` ### Categor\u00eda: Ranges (5) | Feature | C\u00e1lculo | Descripci\u00f3n | |---------|---------|---------------| | `weekly_range_position` | Posici\u00f3n en rango semanal | 0-1 | | `daily_range_position` | Posici\u00f3n en rango diario | 0-1 | | `weekly_range_size` | High - Low semanal | Absoluto | | `daily_range_size` | High - Low diario | Absoluto | | `range_expansion_daily` | Ratio range actual/promedio | >1 = expansi\u00f3n | --- ## Features SMC (12) ### Categor\u00eda: Structure Breaks (6) | Feature | C\u00e1lculo | Uso | |---------|---------|-----| | `choch_bullish_count` | Count en ventana 30 | CHOCHs alcistas | | `choch_bearish_count` | Count en ventana 30 | CHOCHs bajistas | | `bos_bullish_count` | Count en ventana 30 | BOS alcistas | | `bos_bearish_count` | Count en ventana 30 | BOS bajistas | | `choch_recency` | Bars desde \u00faltimo CHOCH | 0 = muy reciente | | `bos_recency` | Bars desde \u00faltimo BOS | 0 = muy reciente | ```python def calculate_smc_features(df): features = {} # Detectar CHOCHs y BOS choch_signals = detect_choch(df, window=20) bos_signals = detect_bos(df, window=20) # Contar por tipo features['choch_bullish_count'] = count_signals_in_window( choch_signals, 'bullish_choch', window=30 ) features['choch_bearish_count'] = count_signals_in_window( choch_signals, 'bearish_choch', window=30 ) features['bos_bullish_count'] = count_signals_in_window( bos_signals, 'bullish_bos', window=30 ) features['bos_bearish_count'] = count_signals_in_window( bos_signals, 'bearish_bos', window=30 ) # Recency features['choch_recency'] = bars_since_last_signal(choch_signals) features['bos_recency'] = bars_since_last_signal(bos_signals) return features ``` ### Categor\u00eda: Displacement & Flow (6) | Feature | C\u00e1lculo | Descripci\u00f3n | |---------|---------|---------------| | `displacement_strength` | Movimiento / ATR | Fuerza de displacement | | `displacement_direction` | 1=alcista, -1=bajista, 0=neutral | Direcci\u00f3n | | `displacement_recency` | Bars desde \u00faltimo | Recencia | | `inducement_count` | Count en ventana 20 | Inducements detectados | | `inducement_bullish` | Count bullish inducements | Trampas alcistas | | `inducement_bearish` | Count bearish inducements | Trampas bajistas | --- ## Features de Liquidez (10) | Feature | C\u00e1lculo | Rango | Descripci\u00f3n | |---------|---------|-------|---------------| | `bsl_distance` | `(bsl_level - close) / close` | [0, ∞) | Distancia a BSL | | `ssl_distance` | `(close - ssl_level) / close` | [0, ∞) | Distancia a SSL | | `bsl_density` | Count de BSL levels cercanos | [0, ∞) | Densidad de BSL | | `ssl_density` | Count de SSL levels cercanos | [0, ∞) | Densidad de SSL | | `bsl_strength` | Volumen en BSL level | [0, ∞) | Fuerza del BSL | | `ssl_strength` | Volumen en SSL level | [0, ∞) | Fuerza del SSL | | `liquidity_grab_count` | Count sweeps recientes | [0, ∞) | Sweeps recientes | | `bsl_sweep_recent` | 1 si sweep reciente | {0, 1} | BSL swept | | `ssl_sweep_recent` | 1 si sweep reciente | {0, 1} | SSL swept | | `near_liquidity` | 1 si <1% de level | {0, 1} | Cerca de liquidez | ```python def calculate_liquidity_features(df, lookback=20): features = {} # Identificar liquidity pools swing_highs = df['high'].rolling(lookback, center=True).max() swing_lows = df['low'].rolling(lookback, center=True).min() # BSL (Buy Side Liquidity) bsl_levels = find_liquidity_levels(df, 'high', lookback) features['bsl_distance'] = (bsl_levels['nearest'] - df['close']) / df['close'] features['bsl_density'] = bsl_levels['density'] features['bsl_strength'] = bsl_levels['strength'] # SSL (Sell Side Liquidity) ssl_levels = find_liquidity_levels(df, 'low', lookback) features['ssl_distance'] = (df['close'] - ssl_levels['nearest']) / df['close'] features['ssl_density'] = ssl_levels['density'] features['ssl_strength'] = ssl_levels['strength'] # Sweeps sweeps = detect_liquidity_sweeps(df, window=30) features['liquidity_grab_count'] = len(sweeps) features['bsl_sweep_recent'] = any(s['type'] == 'bsl' for s in sweeps[-5:]) features['ssl_sweep_recent'] = any(s['type'] == 'ssl' for s in sweeps[-5:]) # Proximity features['near_liquidity'] = ( (features['bsl_distance'] < 0.01) | (features['ssl_distance'] < 0.01) ).astype(int) return features ``` --- ## Features de Microestructura (8) **Nota:** Requiere datos de volumen granular o tick data | Feature | C\u00e1lculo | Descripci\u00f3n | |---------|---------|---------------| | `volume_delta` | `buy_volume - sell_volume` | Delta de volumen | | `cumulative_volume_delta` | CVD acumulado | CVD | | `cvd_slope` | `cvd.diff(5) / 5` | Tendencia CVD | | `tick_imbalance` | `(upticks - downticks) / total_ticks` | Imbalance de ticks | | `large_orders_count` | Count de \u00f3rdenes grandes | Actividad institucional | | `order_flow_imbalance` | Ratio buy/sell | -1 a +1 | | `poc_distance` | Distancia a Point of Control | Distancia a POC | | `hvn_proximity` | Distancia a High Volume Node | Zona de alto volumen | ```python def calculate_microstructure_features(df): """ Requiere datos extendidos: buy_volume, sell_volume, tick_data """ features = {} if 'buy_volume' in df.columns and 'sell_volume' in df.columns: features['volume_delta'] = df['buy_volume'] - df['sell_volume'] features['cumulative_volume_delta'] = features['volume_delta'].cumsum() features['cvd_slope'] = features['cumulative_volume_delta'].diff(5) / 5 total_volume = df['buy_volume'] + df['sell_volume'] features['order_flow_imbalance'] = features['volume_delta'] / (total_volume + 1e-8) # Large orders threshold = df['volume'].rolling(20).mean() * 2 features['large_orders_count'] = (df['volume'] > threshold).rolling(30).sum() # Volume profile volume_profile = calculate_volume_profile(df, bins=50) features['poc_distance'] = (df['close'] - volume_profile['poc']) / df['close'] return features ``` --- ## Targets para Modelos ### Target 1: AMD Phase (AMDDetector) ```python TARGET_AMD_PHASE = { 0: 'neutral', 1: 'accumulation', 2: 'manipulation', 3: 'distribution' } def label_amd_phase(df, i, forward_window=20): """ Ver documentaci\u00f3n ESTRATEGIA-AMD-COMPLETA.md """ # Implementaci\u00f3n completa en documento AMD pass ``` ### Target 2: Delta High/Low (RangePredictor) ```python # Targets de regresi\u00f3n TARGETS_RANGE = { 'delta_high_15m': float, # Predicci\u00f3n continua 'delta_low_15m': float, 'delta_high_1h': float, 'delta_low_1h': float, # Targets de clasificaci\u00f3n (bins) 'bin_high_15m': int, # 0-3 'bin_low_15m': int, 'bin_high_1h': int, 'bin_low_1h': int } def calculate_range_targets(df, horizons={'15m': 3, '1h': 12}): targets = {} atr = calculate_atr(df, 14) for name, periods in horizons.items(): # Delta high targets[f'delta_high_{name}'] = ( df['high'].rolling(periods).max().shift(-periods) - df['close'] ) / df['close'] # Delta low targets[f'delta_low_{name}'] = ( df['close'] - df['low'].rolling(periods).min().shift(-periods) ) / df['close'] # Bins (volatilidad normalizada por ATR) def to_bin(delta_series): ratio = delta_series / atr bins = pd.cut( ratio, bins=[-np.inf, 0.3, 0.7, 1.2, np.inf], labels=[0, 1, 2, 3] ) return bins.astype(float) targets[f'bin_high_{name}'] = to_bin(targets[f'delta_high_{name}']) targets[f'bin_low_{name}'] = to_bin(targets[f'delta_low_{name}']) return pd.DataFrame(targets) ``` ### Target 3: TP vs SL (TPSLClassifier) ```python TARGETS_TPSL = { 'tp_first_15m_rr_2_1': int, # 0 o 1 'tp_first_15m_rr_3_1': int, 'tp_first_1h_rr_2_1': int, 'tp_first_1h_rr_3_1': int } def calculate_tpsl_targets(df, rr_configs): """ Simula si TP se alcanza antes que SL """ targets = {} atr = calculate_atr(df, 14) for rr in rr_configs: sl_dist = atr * rr['sl_atr_multiple'] tp_dist = atr * rr['tp_atr_multiple'] def check_tp_first(i, horizon_bars): if i + horizon_bars >= len(df): return np.nan entry_price = df['close'].iloc[i] sl_price = entry_price - sl_dist.iloc[i] tp_price = entry_price + tp_dist.iloc[i] future = df.iloc[i+1:i+horizon_bars+1] for _, row in future.iterrows(): if row['low'] <= sl_price: return 0 # SL hit first elif row['high'] >= tp_price: return 1 # TP hit first return np.nan # Neither hit for horizon_name, horizon_bars in [('15m', 3), ('1h', 12)]: target_name = f'tp_first_{horizon_name}_{rr["name"]}' targets[target_name] = [ check_tp_first(i, horizon_bars) for i in range(len(df)) ] return pd.DataFrame(targets) ``` ### Target 4: Liquidity Sweep (LiquidityHunter) ```python TARGETS_LIQUIDITY = { 'bsl_sweep': int, # 0 o 1 'ssl_sweep': int, 'any_sweep': int, 'sweep_timing': int # Bars hasta sweep } def label_liquidity_sweep(df, i, forward_window=10): """ Etiqueta si habr\u00e1 liquidity sweep """ if i + forward_window >= len(df): return {'bsl_sweep': np.nan, 'ssl_sweep': np.nan} swing_high = df['high'].iloc[max(0, i-20):i].max() swing_low = df['low'].iloc[max(0, i-20):i].min() future = df.iloc[i:i+forward_window] # BSL sweep (sweep of highs) bsl_swept = (future['high'] >= swing_high * 1.005).any() # SSL sweep (sweep of lows) ssl_swept = (future['low'] <= swing_low * 0.995).any() # Timing if bsl_swept: sweep_timing = (future['high'] >= swing_high * 1.005).idxmax() elif ssl_swept: sweep_timing = (future['low'] <= swing_low * 0.995).idxmax() else: sweep_timing = np.nan return { 'bsl_sweep': 1 if bsl_swept else 0, 'ssl_sweep': 1 if ssl_swept else 0, 'any_sweep': 1 if (bsl_swept or ssl_swept) else 0, 'sweep_timing': sweep_timing } ``` ### Target 5: Order Flow (OrderFlowAnalyzer) ```python TARGETS_ORDER_FLOW = { 'flow_type': int, # 0=neutral, 1=accumulation, 2=distribution 'institutional_activity': float # 0-1 score } def label_order_flow(df, i, forward_window=50): """ Basado en CVD y large orders """ if 'cumulative_volume_delta' not in df.columns: return {'flow_type': 0} current_cvd = df['cumulative_volume_delta'].iloc[i] future_cvd = df['cumulative_volume_delta'].iloc[i+forward_window] cvd_change = future_cvd - current_cvd # Large orders in window large_orders = df['large_orders_count'].iloc[i:i+forward_window].sum() if cvd_change > 0 and large_orders > 5: flow_type = 1 # accumulation elif cvd_change < 0 and large_orders > 5: flow_type = 2 # distribution else: flow_type = 0 # neutral institutional_activity = min(1.0, large_orders / 10) return { 'flow_type': flow_type, 'institutional_activity': institutional_activity } ``` --- ## Feature Engineering Pipeline ### Pipeline Completo ```python class FeatureEngineeringPipeline: """ Pipeline completo de feature engineering """ def __init__(self, config=None): self.config = config or {} self.scalers = {} def transform(self, df): """ Transforma OHLCV raw a features completos """ features = pd.DataFrame(index=df.index) # 1. Base features print("Extracting base features...") base = self._extract_base_features(df) features = pd.concat([features, base], axis=1) # 2. AMD features print("Extracting AMD features...") amd = self._extract_amd_features(df) features = pd.concat([features, amd], axis=1) # 3. ICT features print("Extracting ICT features...") ict = self._extract_ict_features(df) features = pd.concat([features, ict], axis=1) # 4. SMC features print("Extracting SMC features...") smc = self._extract_smc_features(df) features = pd.concat([features, smc], axis=1) # 5. Liquidity features print("Extracting liquidity features...") liquidity = self._extract_liquidity_features(df) features = pd.concat([features, liquidity], axis=1) # 6. Microstructure (si disponible) if 'buy_volume' in df.columns: print("Extracting microstructure features...") micro = self._extract_microstructure_features(df) features = pd.concat([features, micro], axis=1) # 7. Scaling print("Scaling features...") features_scaled = self._scale_features(features) # 8. Handle missing values features_scaled = features_scaled.fillna(method='ffill').fillna(0) return features_scaled def _extract_base_features(self, df): """Extrae features base (21)""" features = {} # Volatilidad features.update(calculate_volatility_features(df)) # Momentum features.update(calculate_momentum_features(df)) # Moving averages features.update(calculate_ma_features(df)) return pd.DataFrame(features) def _scale_features(self, features): """Escala features usando RobustScaler""" from sklearn.preprocessing import RobustScaler if not self.scalers: # Fit scalers for col in features.columns: self.scalers[col] = RobustScaler() features[col] = self.scalers[col].fit_transform( features[col].values.reshape(-1, 1) ) else: # Transform with fitted scalers for col in features.columns: if col in self.scalers: features[col] = self.scalers[col].transform( features[col].values.reshape(-1, 1) ) return features ``` ### Uso del Pipeline ```python # Inicializar pipeline = FeatureEngineeringPipeline() # Transformar datos df_raw = load_ohlcv_data('BTCUSDT', '5m') features = pipeline.transform(df_raw) print(f"Features shape: {features.shape}") print(f"Features: {features.columns.tolist()}") # Features ready for ML models X = features.values ``` --- ## Consideraciones T\u00e9cnicas ### 1. Prevenci\u00f3n de Look-Ahead Bias **IMPORTANTE:** Nunca usar datos futuros para calcular features ```python # ✅ CORRECTO sma_20 = df['close'].rolling(20).mean() # ❌ INCORRECTO sma_20 = df['close'].rolling(20, center=True).mean() # Usa datos futuros! ``` ### 2. Handling Missing Values ```python def handle_missing(features): """ Estrategia de imputaci\u00f3n """ # 1. Forward fill (usar \u00faltimo valor conocido) features = features.fillna(method='ffill') # 2. Si a\u00fan hay NaNs al inicio, usar 0 features = features.fillna(0) # 3. Alternativa: usar median # features = features.fillna(features.median()) return features ``` ### 3. Feature Scaling ```python from sklearn.preprocessing import RobustScaler, StandardScaler, MinMaxScaler # Price-based features → RobustScaler (maneja outliers) price_scaler = RobustScaler() # Indicators → StandardScaler indicator_scaler = StandardScaler() # Ratios/percentages → MinMaxScaler ratio_scaler = MinMaxScaler(feature_range=(0, 1)) ``` ### 4. Feature Selection ```python def select_important_features(X, y, model, top_n=50): """ Selecciona features m\u00e1s importantes """ # Train model model.fit(X, y) # Get importance importance = pd.DataFrame({ 'feature': feature_names, 'importance': model.feature_importances_ }).sort_values('importance', ascending=False) # Select top N selected_features = importance.head(top_n)['feature'].tolist() return selected_features ``` ### 5. Validaci\u00f3n Temporal ```python def temporal_validation_split(df, train_pct=0.7, val_pct=0.15): """ Split temporal estricto (sin shuffle) """ n = len(df) train_end = int(n * train_pct) val_end = int(n * (train_pct + val_pct)) df_train = df.iloc[:train_end] df_val = df.iloc[train_end:val_end] df_test = df.iloc[val_end:] # Verificar no hay overlap assert df_train.index[-1] < df_val.index[0] assert df_val.index[-1] < df_test.index[0] return df_train, df_val, df_test ``` --- ## Resumen de Dimensiones | Categor\u00eda | Features | Modelos | |-----------|----------|---------| | **Base T\u00e9cnicos** | 21 | Todos | | **AMD** | 25 | AMD, Range, TPSL | | **ICT** | 15 | Range, TPSL | | **SMC** | 12 | Range, TPSL | | **Liquidez** | 10 | Liquidity, TPSL | | **Microestructura** | 8 | OrderFlow | | **TOTAL** | **91 features** | - | --- **Documento Generado:** 2025-12-05 **Pr\u00f3xima Revisi\u00f3n:** 2025-Q1 **Contacto:** ml-engineering@orbiquant.ai