# Postmortem: the retry storm of June 3rd

Severity: SEV-2 · Duration: 47 minutes · User impact: delayed syncs, no data loss · Author: on-call

## Summary

A malformed pagination cursor from one connector caused its extraction job to fail *after* side effects but *before* checkpointing. The retry policy did exactly what it was configured to do, which was the problem: 8 retries × 200 affected objects × one enqueued webhook each ≈ a self-inflicted thundering herd that saturated the worker pool for everyone.

## Timeline (UTC)

- **14:02** — Vendor API deploys a change: cursors now expire after 60s instead of 10min.
- **14:07** — First `InvalidCursor` failures. Jobs retry from the *start* of pagination, re-emitting events for pages already processed.
- **14:11** — Queue depth alert. On-call assumes vendor outage, waits.
- **14:26** — Worker pool at 100% on one connector's jobs; other sources' freshness SLOs breached. SEV-2 declared.
- **14:38** — Mitigation: connector paused, queue drained of its events (12k), pool recovers.
- **14:49** — Patched cursor refresh shipped; connector resumed; backlog cleared by 15:20.

## What went wrong, in order of importance

1. **Retries restarted the whole pagination loop instead of resuming.** The checkpoint lived at job level, not page level. One bad page at position 190 meant re-fetching 189 good pages — and re-emitting their events.
2. **Emitted events weren't tied to checkpoint success.** We produced side effects for work we then declared failed. Replay-safety assumed consumers dedup; the dedup window (30s) was tuned for webhook storms, not multi-minute retry loops.
3. **No per-source circuit breaker.** A connector failing 100% of jobs for 20 minutes should isolate itself. Fairness buckets cap *throughput* per source; they don't cap *failure spend*.

## What went right

- Pausing one connector was a one-command mitigation and affected only that source.
- No data loss or duplication reached users — downstream idempotency held even though we leaned on it harder than designed.

## Action items

- [x] Page-level checkpoints; retries resume, never restart. *(shipped June 5)*
- [x] Events emit only after checkpoint commit — outbox pattern. *(shipped June 9)*
- [ ] Circuit breaker: 10 consecutive failures on a source → auto-pause + alert, owner: platform.
- [ ] Dedup window becomes per-source and adaptive to retry depth.
- [ ] Synthetic canary that exercises cursor expiry against each vendor weekly.

## The lesson we keep relearning

Retry policies encode an assumption about *why* things fail. Ours assumed transient network blips. When the failure is deterministic — same input, same error — retries are just amplification. Classify before retrying.

---

*Discussion welcome. Select any line to comment.*
