🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
221 lines
5.1 KiB
Markdown
221 lines
5.1 KiB
Markdown
---
|
|
id: "RF-SCR-004"
|
|
title: "Scheduling y Job Management"
|
|
type: "Functional Requirement"
|
|
epic: "IAI-007"
|
|
priority: "Media"
|
|
status: "Draft"
|
|
project: "inmobiliaria-analytics"
|
|
created_date: "2026-01-04"
|
|
updated_date: "2026-01-04"
|
|
---
|
|
|
|
# RF-IA-007-004: Scheduling y Job Management
|
|
|
|
---
|
|
|
|
## Descripcion
|
|
|
|
El sistema debe programar y gestionar trabajos de scraping, permitiendo ejecuciones programadas, incrementales y bajo demanda, con capacidad de pausar, reanudar y monitorear el progreso.
|
|
|
|
---
|
|
|
|
## Justificacion
|
|
|
|
La recoleccion de datos debe ser automatizada y eficiente. Los trabajos programados permiten mantener datos actualizados, mientras que las sincronizaciones incrementales optimizan recursos al procesar solo cambios.
|
|
|
|
---
|
|
|
|
## Requisitos Funcionales
|
|
|
|
### RF-004.1: Tipos de Jobs
|
|
|
|
| ID | Requisito | Prioridad |
|
|
|----|-----------|-----------|
|
|
| RF-004.1.1 | El sistema debe soportar jobs de tipo "full_scan" | Alta |
|
|
| RF-004.1.2 | El sistema debe soportar jobs de tipo "incremental" | Alta |
|
|
| RF-004.1.3 | El sistema debe soportar jobs de tipo "targeted" (URLs especificas) | Media |
|
|
| RF-004.1.4 | El sistema debe soportar jobs de tipo "refresh" (verificar activas) | Media |
|
|
|
|
### RF-004.2: Scheduling
|
|
|
|
| ID | Requisito | Prioridad |
|
|
|----|-----------|-----------|
|
|
| RF-004.2.1 | El sistema debe permitir programar jobs con expresiones cron | Alta |
|
|
| RF-004.2.2 | El sistema debe ejecutar jobs bajo demanda via API | Alta |
|
|
| RF-004.2.3 | El sistema debe respetar horarios de baja demanda | Media |
|
|
| RF-004.2.4 | El sistema debe distribuir carga entre workers | Media |
|
|
|
|
### RF-004.3: Job Lifecycle
|
|
|
|
| ID | Requisito | Prioridad |
|
|
|----|-----------|-----------|
|
|
| RF-004.3.1 | El sistema debe permitir pausar jobs en ejecucion | Alta |
|
|
| RF-004.3.2 | El sistema debe permitir reanudar jobs pausados | Alta |
|
|
| RF-004.3.3 | El sistema debe cancelar jobs con cleanup apropiado | Alta |
|
|
| RF-004.3.4 | El sistema debe reintentar jobs fallidos con backoff | Alta |
|
|
|
|
### RF-004.4: Monitoreo
|
|
|
|
| ID | Requisito | Prioridad |
|
|
|----|-----------|-----------|
|
|
| RF-004.4.1 | El sistema debe reportar progreso en tiempo real | Alta |
|
|
| RF-004.4.2 | El sistema debe registrar estadisticas por job | Alta |
|
|
| RF-004.4.3 | El sistema debe alertar en caso de fallas | Alta |
|
|
| RF-004.4.4 | El sistema debe mantener historial de ejecuciones | Media |
|
|
|
|
---
|
|
|
|
## Modelo de Datos
|
|
|
|
```yaml
|
|
ScrapingJob:
|
|
id: UUID
|
|
type: enum [full_scan, incremental, targeted, refresh]
|
|
source: string # inmuebles24, vivanuncios, all
|
|
status: enum [pending, queued, running, paused, completed, failed, cancelled]
|
|
|
|
config:
|
|
target_cities: string[]
|
|
property_types: string[]
|
|
max_pages: integer
|
|
max_properties: integer
|
|
delay_ms:
|
|
min: integer
|
|
max: integer
|
|
|
|
schedule:
|
|
cron_expression: string (nullable)
|
|
next_run_at: timestamp (nullable)
|
|
timezone: string
|
|
|
|
progress:
|
|
pages_scraped: integer
|
|
properties_found: integer
|
|
properties_processed: integer
|
|
errors: integer
|
|
current_page: string
|
|
|
|
stats:
|
|
started_at: timestamp
|
|
completed_at: timestamp
|
|
duration_ms: integer
|
|
success_rate: decimal
|
|
|
|
retry:
|
|
attempts: integer
|
|
max_attempts: integer
|
|
last_error: string
|
|
|
|
created_at: timestamp
|
|
updated_at: timestamp
|
|
created_by: UUID
|
|
```
|
|
|
|
---
|
|
|
|
## Configuracion de Schedules
|
|
|
|
```yaml
|
|
schedules:
|
|
full_scan:
|
|
inmuebles24:
|
|
cron: "0 2 * * 0" # Domingos 2am
|
|
config:
|
|
max_pages: 100
|
|
cities: [guadalajara, monterrey, cdmx]
|
|
|
|
vivanuncios:
|
|
cron: "0 3 * * 0" # Domingos 3am
|
|
config:
|
|
max_pages: 100
|
|
|
|
incremental:
|
|
all_sources:
|
|
cron: "0 4 * * *" # Diario 4am
|
|
config:
|
|
max_pages: 20
|
|
only_new: true
|
|
|
|
refresh:
|
|
active_properties:
|
|
cron: "0 */6 * * *" # Cada 6 horas
|
|
config:
|
|
check_active: true
|
|
mark_inactive_after_days: 7
|
|
```
|
|
|
|
---
|
|
|
|
## API Endpoints
|
|
|
|
```yaml
|
|
POST /api/v1/scraper/jobs:
|
|
description: Crear nuevo job
|
|
body:
|
|
type: string
|
|
source: string
|
|
config: object
|
|
response: 201 Created
|
|
|
|
GET /api/v1/scraper/jobs:
|
|
description: Listar jobs
|
|
query:
|
|
status: string
|
|
source: string
|
|
limit: integer
|
|
offset: integer
|
|
response: 200 OK (paginado)
|
|
|
|
GET /api/v1/scraper/jobs/:id:
|
|
description: Obtener job con progreso
|
|
response: 200 OK
|
|
|
|
POST /api/v1/scraper/jobs/:id/pause:
|
|
description: Pausar job
|
|
response: 200 OK
|
|
|
|
POST /api/v1/scraper/jobs/:id/resume:
|
|
description: Reanudar job
|
|
response: 200 OK
|
|
|
|
DELETE /api/v1/scraper/jobs/:id:
|
|
description: Cancelar job
|
|
response: 204 No Content
|
|
|
|
GET /api/v1/scraper/stats:
|
|
description: Estadisticas globales
|
|
response: 200 OK
|
|
```
|
|
|
|
---
|
|
|
|
## Criterios de Aceptacion
|
|
|
|
- [ ] Jobs se programan correctamente con cron expressions
|
|
- [ ] Jobs se pueden ejecutar bajo demanda via API
|
|
- [ ] Pausar/reanudar funciona sin perder progreso
|
|
- [ ] Reintentos usan exponential backoff
|
|
- [ ] Progreso se actualiza en tiempo real
|
|
- [ ] Estadisticas se calculan correctamente
|
|
- [ ] Alertas se envian en caso de fallas
|
|
|
|
---
|
|
|
|
## Dependencias
|
|
|
|
- Bull Queue (Redis)
|
|
- node-cron o similar
|
|
- Redis para estado de jobs
|
|
|
|
---
|
|
|
|
## Historias de Usuario Relacionadas
|
|
|
|
- US-SCR-004: Programacion de jobs
|
|
|
|
---
|
|
|
|
**Autor:** Tech Lead
|
|
**Fecha:** 2026-01-04
|