michangarrito/docs/97-adr/ADR-0007-webhook-retry-strategy.md

---
id: ADR-0007
type: ADR
title: "Webhook Retry Strategy"
status: Accepted
decision_date: 2026-01-10
updated_at: 2026-01-10
simco_version: "4.0.1"
stakeholders:
  - "Equipo MiChangarrito"
tags:
  - webhooks
  - retry
  - bullmq
  - resilience
---

# ADR-0007: Webhook Retry Strategy

## Metadata

| Campo | Valor |
|-------|-------|
| **ID** | ADR-0007 |
| **Estado** | Accepted |
| **Fecha** | 2026-01-10 |
| **Autor** | Architecture Team |
| **Supersede** | - |

---

## Contexto

MiChangarrito ofrece webhooks outbound para notificar a sistemas externos sobre eventos. Los endpoints destino pueden fallar temporalmente, y necesitamos una estrategia de reintentos que:

1. Sea resiliente a fallos temporales
2. No sobrecargue el destino
3. Eventualmente falle despues de intentos razonables
4. Proporcione visibilidad del estado

---

## Decision

**Adoptamos exponential backoff con jitter usando BullMQ, con maximo 6 intentos y timeout de 30 segundos por request.**

```
Intento 1: Inmediato
Intento 2: 1s + jitter
Intento 3: 2s + jitter
Intento 4: 4s + jitter
Intento 5: 8s + jitter
Intento 6: 16s + jitter
```

Despues del intento 6, el webhook se marca como fallido y se registra en logs.

---

## Alternativas Consideradas

### Opcion 1: Retry inmediato
- **Pros:**
  - Simple
- **Cons:**
  - Puede sobrecargar el destino
  - Fallos en cascada

### Opcion 2: Fixed interval
- **Pros:**
  - Predecible
- **Cons:**
  - No se adapta a la situacion
  - Thundering herd problem

### Opcion 3: Exponential backoff con jitter (Elegida)
- **Pros:**
  - Reduce carga en destino
  - Evita thundering herd
  - Estandar de industria
- **Cons:**
  - Mas tiempo total antes de fallo definitivo

---

## Consecuencias

### Positivas

1. **Resilencia:** Tolera fallos temporales
2. **Cortesia:** No sobrecarga destinos
3. **Predecible:** Comportamiento conocido

### Negativas

1. **Latencia:** Puede tomar ~31 segundos en fallar definitivamente
2. **Complejidad:** Manejo de estados de entrega

---

## Implementacion

### Configuracion BullMQ

```typescript
await this.webhookQueue.add('deliver', payload, {
  attempts: 6,
  backoff: {
    type: 'exponential',
    delay: 1000, // Base: 1 segundo
  },
  removeOnComplete: {
    age: 86400, // 24 horas
    count: 1000,
  },
  removeOnFail: false,
});
```

### Logica de Retry

```typescript
@Process('deliver')
async handleDelivery(job: Job<WebhookPayload>) {
  try {
    const response = await this.httpService.axiosRef.post(
      job.data.url,
      job.data.payload,
      {
        timeout: 30000,
        headers: this.buildHeaders(job.data),
      }
    );

    if (response.status >= 200 && response.status < 300) {
      return { success: true, status: response.status };
    }

    throw new Error(`Unexpected status: ${response.status}`);
  } catch (error) {
    const shouldRetry = this.shouldRetry(error);

    this.logger.warn('Webhook delivery failed', {
      attempt: job.attemptsMade + 1,
      maxAttempts: job.opts.attempts,
      willRetry: shouldRetry,
      error: error.message,
    });

    if (!shouldRetry) {
      // No reintentar, marcar como fallido definitivo
      await this.markAsFailed(job.data.deliveryId, error.message);
      return { success: false, permanent: true };
    }

    throw error; // BullMQ reintentara
  }
}

private shouldRetry(error: any): boolean {
  // No reintentar errores del cliente (4xx) excepto 429
  if (error.response) {
    const status = error.response.status;
    if (status === 429) return true; // Rate limited
    if (status >= 400 && status < 500) return false; // Client error
    if (status >= 500) return true; // Server error
  }

  // Reintentar errores de red y timeouts
  return true;
}
```

### Jitter

BullMQ aplica jitter automaticamente. Si queremos control manual:

```typescript
function getBackoffDelay(attempt: number): number {
  const baseDelay = 1000;
  const maxDelay = 16000;
  const exponentialDelay = Math.min(baseDelay * Math.pow(2, attempt), maxDelay);
  const jitter = Math.random() * 1000; // 0-1 segundo de jitter
  return exponentialDelay + jitter;
}
```

---

## Codigos de Respuesta y Acciones

| Codigo | Categoria | Accion | Retry |
|--------|-----------|--------|-------|
| 200-299 | Exito | Marcar entregado | No |
| 301-308 | Redirect | Seguir redirect | - |
| 400 | Bad Request | Marcar fallido | No |
| 401 | Unauthorized | Marcar fallido | No |
| 403 | Forbidden | Marcar fallido | No |
| 404 | Not Found | Marcar fallido | No |
| 429 | Rate Limited | Retry con delay | Si |
| 500-599 | Server Error | Retry | Si |
| Timeout | Network | Retry | Si |
| ECONNREFUSED | Network | Retry | Si |

---

## Monitoreo

### Metricas

```typescript
// Prometheus metrics
webhook_delivery_attempts_total{status="success|retry|failed"}
webhook_delivery_duration_seconds
webhook_delivery_retries_total
```

### Alertas

- `webhook_delivery_failure_rate > 0.1` - Mas del 10% fallando
- `webhook_queue_length > 100` - Cola creciendo
- `webhook_delivery_duration_seconds_p99 > 25` - Latencia alta

---

## Referencias

- [Exponential Backoff](https://aws.amazon.com/blogs/architecture/exponential-backoff-and-jitter/)
- [BullMQ Retries](https://docs.bullmq.io/guide/retrying-failing-jobs)
- [Stripe Webhooks](https://stripe.com/docs/webhooks/best-practices)
- [INT-014: Webhooks Outbound](../02-integraciones/INT-014-webhooks-outbound.md)

---

**Fecha decision:** 2026-01-10
**Autores:** Architecture Team