🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
197 lines
4.7 KiB
Markdown
197 lines
4.7 KiB
Markdown
---
|
|
id: "EPIC-IAI-007"
|
|
title: "EPIC IAI-007: Sistema de Web Scraping y ETL"
|
|
type: "EPIC"
|
|
epic: "IAI-007"
|
|
status: "Draft"
|
|
project: "inmobiliaria-analytics"
|
|
version: "1.0.0"
|
|
story_points: 55
|
|
created_date: "2026-01-04"
|
|
updated_date: "2026-01-04"
|
|
---
|
|
|
|
# EPIC IAI-007: Sistema de Web Scraping y ETL
|
|
|
|
---
|
|
|
|
## Resumen Ejecutivo
|
|
|
|
Este EPIC implementa el sistema de recoleccion automatizada de datos inmobiliarios desde multiples portales (Inmuebles24, Vivanuncios, etc.), incluyendo estrategias anti-deteccion, normalizacion de datos y pipeline ETL para alimentar la plataforma de analytics.
|
|
|
|
---
|
|
|
|
## Objetivo
|
|
|
|
Construir un sistema robusto de web scraping capaz de:
|
|
1. Extraer datos de propiedades de portales inmobiliarios protegidos
|
|
2. Evitar bloqueos mediante tecnicas anti-detection
|
|
3. Normalizar y validar datos de multiples fuentes
|
|
4. Mantener actualizaciones incrementales eficientes
|
|
|
|
---
|
|
|
|
## Alcance
|
|
|
|
### Incluido
|
|
|
|
- Motor de scraping con Playwright/Puppeteer
|
|
- Gestion de proxies residenciales
|
|
- Bypass de Cloudflare y rate limiting
|
|
- Pipeline ETL para normalizacion
|
|
- Scheduling con Bull Queue
|
|
- Monitoreo y metricas
|
|
|
|
### Excluido
|
|
|
|
- App mobile de administracion
|
|
- Scraping de imagenes (fase 2)
|
|
- APIs de terceros (Apify, etc.)
|
|
- ML para extraccion (fase futura)
|
|
|
|
---
|
|
|
|
## Fuentes de Datos Objetivo
|
|
|
|
| Fuente | Prioridad | Proteccion | Estado |
|
|
|--------|-----------|------------|--------|
|
|
| Inmuebles24 | P1 | Cloudflare | Target |
|
|
| Vivanuncios | P1 | Cloudflare | Target |
|
|
| Segundamano | P2 | Basica | Backlog |
|
|
| Metros Cubicos | P2 | Cloudflare | Backlog |
|
|
|
|
---
|
|
|
|
## Stack Tecnico
|
|
|
|
```yaml
|
|
Scraping:
|
|
browser: Playwright
|
|
stealth: playwright-extra-stealth
|
|
fallback: Puppeteer + undetected-chrome
|
|
|
|
Proxies:
|
|
type: Residencial
|
|
rotation: Por sesion
|
|
provider: Bright Data / IPRoyal
|
|
|
|
ETL:
|
|
queue: Bull (Redis)
|
|
parser: Cheerio
|
|
geocoding: Google Maps API
|
|
|
|
Storage:
|
|
raw: S3/MinIO (JSON)
|
|
normalized: PostgreSQL
|
|
```
|
|
|
|
---
|
|
|
|
## Arquitectura de Alto Nivel
|
|
|
|
```
|
|
+----------------+
|
|
| Scheduler |
|
|
| (Bull Queue) |
|
|
+-------+--------+
|
|
|
|
|
+-------------+-------------+
|
|
| |
|
|
+-------v-------+ +---------v---------+
|
|
| Scraper Pool | | ETL Pipeline |
|
|
| (Playwright) | | (Normalization) |
|
|
+-------+-------+ +---------+---------+
|
|
| |
|
|
| +---------------+ |
|
|
+-->| Proxy Pool |<------+
|
|
+---------------+
|
|
|
|
|
+-----------+-----------+
|
|
| |
|
|
+-------v-------+ +-------v-------+
|
|
| Raw Storage | | PostgreSQL |
|
|
| (S3/JSON) | | (properties) |
|
|
+---------------+ +---------------+
|
|
```
|
|
|
|
---
|
|
|
|
## Desglose de Trabajo
|
|
|
|
### Fase 1: MVP Scraper (2-3 sprints)
|
|
|
|
| Tarea | SP | Prioridad |
|
|
|-------|----|-----------|
|
|
| Setup Playwright + stealth | 3 | Alta |
|
|
| Scraper Inmuebles24 basico | 8 | Alta |
|
|
| Integracion proxy pool | 5 | Alta |
|
|
| Normalizacion basica | 5 | Alta |
|
|
| Job scheduling simple | 3 | Media |
|
|
|
|
### Fase 2: Multi-source (1-2 sprints)
|
|
|
|
| Tarea | SP | Prioridad |
|
|
|-------|----|-----------|
|
|
| Scraper Vivanuncios | 5 | Alta |
|
|
| Scraper Segundamano | 3 | Media |
|
|
| Deduplicacion cross-source | 5 | Media |
|
|
| Geocoding integration | 3 | Media |
|
|
|
|
### Fase 3: Produccion (1 sprint)
|
|
|
|
| Tarea | SP | Prioridad |
|
|
|-------|----|-----------|
|
|
| Monitoreo y alertas | 5 | Media |
|
|
| Retry logic + error handling | 3 | Media |
|
|
| Dashboard de admin | 5 | Baja |
|
|
| Documentacion | 2 | Baja |
|
|
|
|
---
|
|
|
|
## Estimacion de Costos
|
|
|
|
```yaml
|
|
Infraestructura_mensual:
|
|
proxies_residenciales: $50-100 USD
|
|
captcha_solving: $10-20 USD
|
|
geocoding_api: $0-50 USD
|
|
cloud_compute: $50-100 USD
|
|
|
|
Total: $100-300 USD/mes
|
|
```
|
|
|
|
---
|
|
|
|
## Riesgos y Mitigaciones
|
|
|
|
| Riesgo | Prob | Impacto | Mitigacion |
|
|
|--------|------|---------|------------|
|
|
| Bloqueo Cloudflare | Alta | Alto | Stealth browser, proxies, rate limit |
|
|
| Cambios HTML | Media | Medio | Selectores robustos, alertas |
|
|
| Legal | Baja | Alto | Cumplir ToS, agregar valor |
|
|
| Costos escalan | Media | Bajo | Optimizar, limitar scope |
|
|
|
|
---
|
|
|
|
## Criterios de Aceptacion del EPIC
|
|
|
|
- [ ] Scraper extrae 10,000+ propiedades de Inmuebles24
|
|
- [ ] Tasa de exito >= 85%
|
|
- [ ] Datos normalizados correctamente
|
|
- [ ] Cero bloqueos permanentes de IP
|
|
- [ ] Pipeline ejecuta incrementales diarios
|
|
- [ ] Metricas disponibles en dashboard
|
|
|
|
---
|
|
|
|
## Documentacion Relacionada
|
|
|
|
- [IA-007-WEBSCRAPER.md](../../02-definicion-modulos/IA-007-WEBSCRAPER.md) - Definicion del modulo
|
|
- [Webscraper_Politics.md](../../00-vision-general/Webscraper_Politics.md) - Politicas anti-bloqueo
|
|
|
|
---
|
|
|
|
**EPIC Owner:** Tech Lead
|
|
**Fecha creacion:** 2026-01-04
|
|
**Estado:** Draft
|