Nuevas Épicas (MCH-029 a MCH-033): - Infraestructura SaaS multi-tenant - Auth Social (OAuth2) - Auditoría Empresarial - Feature Flags - Onboarding Wizard Nuevas Integraciones (INT-010 a INT-014): - Email Providers (SendGrid, Mailgun, SES) - Storage Cloud (S3, GCS, Azure) - OAuth Social - Redis Cache - Webhooks Outbound Nuevos ADRs (0004 a 0011): - Notifications Realtime - Feature Flags Strategy - Storage Abstraction - Webhook Retry Strategy - Audit Log Retention - Rate Limiting - OAuth Social Implementation - Email Multi-provider Actualizados: - MASTER_INVENTORY.yml - CONTEXT-MAP.yml - HERENCIA-SIMCO.md - Mapas de documentación Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
237 lines
5.3 KiB
Markdown
237 lines
5.3 KiB
Markdown
---
|
|
id: ADR-0007
|
|
type: ADR
|
|
title: "Webhook Retry Strategy"
|
|
status: Accepted
|
|
decision_date: 2026-01-10
|
|
updated_at: 2026-01-10
|
|
simco_version: "4.0.1"
|
|
stakeholders:
|
|
- "Equipo MiChangarrito"
|
|
tags:
|
|
- webhooks
|
|
- retry
|
|
- bullmq
|
|
- resilience
|
|
---
|
|
|
|
# ADR-0007: Webhook Retry Strategy
|
|
|
|
## Metadata
|
|
|
|
| Campo | Valor |
|
|
|-------|-------|
|
|
| **ID** | ADR-0007 |
|
|
| **Estado** | Accepted |
|
|
| **Fecha** | 2026-01-10 |
|
|
| **Autor** | Architecture Team |
|
|
| **Supersede** | - |
|
|
|
|
---
|
|
|
|
## Contexto
|
|
|
|
MiChangarrito ofrece webhooks outbound para notificar a sistemas externos sobre eventos. Los endpoints destino pueden fallar temporalmente, y necesitamos una estrategia de reintentos que:
|
|
|
|
1. Sea resiliente a fallos temporales
|
|
2. No sobrecargue el destino
|
|
3. Eventualmente falle despues de intentos razonables
|
|
4. Proporcione visibilidad del estado
|
|
|
|
---
|
|
|
|
## Decision
|
|
|
|
**Adoptamos exponential backoff con jitter usando BullMQ, con maximo 6 intentos y timeout de 30 segundos por request.**
|
|
|
|
```
|
|
Intento 1: Inmediato
|
|
Intento 2: 1s + jitter
|
|
Intento 3: 2s + jitter
|
|
Intento 4: 4s + jitter
|
|
Intento 5: 8s + jitter
|
|
Intento 6: 16s + jitter
|
|
```
|
|
|
|
Despues del intento 6, el webhook se marca como fallido y se registra en logs.
|
|
|
|
---
|
|
|
|
## Alternativas Consideradas
|
|
|
|
### Opcion 1: Retry inmediato
|
|
- **Pros:**
|
|
- Simple
|
|
- **Cons:**
|
|
- Puede sobrecargar el destino
|
|
- Fallos en cascada
|
|
|
|
### Opcion 2: Fixed interval
|
|
- **Pros:**
|
|
- Predecible
|
|
- **Cons:**
|
|
- No se adapta a la situacion
|
|
- Thundering herd problem
|
|
|
|
### Opcion 3: Exponential backoff con jitter (Elegida)
|
|
- **Pros:**
|
|
- Reduce carga en destino
|
|
- Evita thundering herd
|
|
- Estandar de industria
|
|
- **Cons:**
|
|
- Mas tiempo total antes de fallo definitivo
|
|
|
|
---
|
|
|
|
## Consecuencias
|
|
|
|
### Positivas
|
|
|
|
1. **Resilencia:** Tolera fallos temporales
|
|
2. **Cortesia:** No sobrecarga destinos
|
|
3. **Predecible:** Comportamiento conocido
|
|
|
|
### Negativas
|
|
|
|
1. **Latencia:** Puede tomar ~31 segundos en fallar definitivamente
|
|
2. **Complejidad:** Manejo de estados de entrega
|
|
|
|
---
|
|
|
|
## Implementacion
|
|
|
|
### Configuracion BullMQ
|
|
|
|
```typescript
|
|
await this.webhookQueue.add('deliver', payload, {
|
|
attempts: 6,
|
|
backoff: {
|
|
type: 'exponential',
|
|
delay: 1000, // Base: 1 segundo
|
|
},
|
|
removeOnComplete: {
|
|
age: 86400, // 24 horas
|
|
count: 1000,
|
|
},
|
|
removeOnFail: false,
|
|
});
|
|
```
|
|
|
|
### Logica de Retry
|
|
|
|
```typescript
|
|
@Process('deliver')
|
|
async handleDelivery(job: Job<WebhookPayload>) {
|
|
try {
|
|
const response = await this.httpService.axiosRef.post(
|
|
job.data.url,
|
|
job.data.payload,
|
|
{
|
|
timeout: 30000,
|
|
headers: this.buildHeaders(job.data),
|
|
}
|
|
);
|
|
|
|
if (response.status >= 200 && response.status < 300) {
|
|
return { success: true, status: response.status };
|
|
}
|
|
|
|
throw new Error(`Unexpected status: ${response.status}`);
|
|
} catch (error) {
|
|
const shouldRetry = this.shouldRetry(error);
|
|
|
|
this.logger.warn('Webhook delivery failed', {
|
|
attempt: job.attemptsMade + 1,
|
|
maxAttempts: job.opts.attempts,
|
|
willRetry: shouldRetry,
|
|
error: error.message,
|
|
});
|
|
|
|
if (!shouldRetry) {
|
|
// No reintentar, marcar como fallido definitivo
|
|
await this.markAsFailed(job.data.deliveryId, error.message);
|
|
return { success: false, permanent: true };
|
|
}
|
|
|
|
throw error; // BullMQ reintentara
|
|
}
|
|
}
|
|
|
|
private shouldRetry(error: any): boolean {
|
|
// No reintentar errores del cliente (4xx) excepto 429
|
|
if (error.response) {
|
|
const status = error.response.status;
|
|
if (status === 429) return true; // Rate limited
|
|
if (status >= 400 && status < 500) return false; // Client error
|
|
if (status >= 500) return true; // Server error
|
|
}
|
|
|
|
// Reintentar errores de red y timeouts
|
|
return true;
|
|
}
|
|
```
|
|
|
|
### Jitter
|
|
|
|
BullMQ aplica jitter automaticamente. Si queremos control manual:
|
|
|
|
```typescript
|
|
function getBackoffDelay(attempt: number): number {
|
|
const baseDelay = 1000;
|
|
const maxDelay = 16000;
|
|
const exponentialDelay = Math.min(baseDelay * Math.pow(2, attempt), maxDelay);
|
|
const jitter = Math.random() * 1000; // 0-1 segundo de jitter
|
|
return exponentialDelay + jitter;
|
|
}
|
|
```
|
|
|
|
---
|
|
|
|
## Codigos de Respuesta y Acciones
|
|
|
|
| Codigo | Categoria | Accion | Retry |
|
|
|--------|-----------|--------|-------|
|
|
| 200-299 | Exito | Marcar entregado | No |
|
|
| 301-308 | Redirect | Seguir redirect | - |
|
|
| 400 | Bad Request | Marcar fallido | No |
|
|
| 401 | Unauthorized | Marcar fallido | No |
|
|
| 403 | Forbidden | Marcar fallido | No |
|
|
| 404 | Not Found | Marcar fallido | No |
|
|
| 429 | Rate Limited | Retry con delay | Si |
|
|
| 500-599 | Server Error | Retry | Si |
|
|
| Timeout | Network | Retry | Si |
|
|
| ECONNREFUSED | Network | Retry | Si |
|
|
|
|
---
|
|
|
|
## Monitoreo
|
|
|
|
### Metricas
|
|
|
|
```typescript
|
|
// Prometheus metrics
|
|
webhook_delivery_attempts_total{status="success|retry|failed"}
|
|
webhook_delivery_duration_seconds
|
|
webhook_delivery_retries_total
|
|
```
|
|
|
|
### Alertas
|
|
|
|
- `webhook_delivery_failure_rate > 0.1` - Mas del 10% fallando
|
|
- `webhook_queue_length > 100` - Cola creciendo
|
|
- `webhook_delivery_duration_seconds_p99 > 25` - Latencia alta
|
|
|
|
---
|
|
|
|
## Referencias
|
|
|
|
- [Exponential Backoff](https://aws.amazon.com/blogs/architecture/exponential-backoff-and-jitter/)
|
|
- [BullMQ Retries](https://docs.bullmq.io/guide/retrying-failing-jobs)
|
|
- [Stripe Webhooks](https://stripe.com/docs/webhooks/best-practices)
|
|
- [INT-014: Webhooks Outbound](../02-integraciones/INT-014-webhooks-outbound.md)
|
|
|
|
---
|
|
|
|
**Fecha decision:** 2026-01-10
|
|
**Autores:** Architecture Team
|