# Technical plan: event-driven ingestion pipeline

Status: **approved** · Owner: platform team · Reviewers: 3 · Target: end of Q3

## Context

Ingestion today is a cron sweep: every 10 minutes a worker lists changed objects per source, diffs against our index, and processes the delta synchronously. At ~40 connected sources per workspace this held up fine. At 400 it doesn't: sweeps overlap, hot sources starve cold ones, and a single slow API (looking at you, SharePoint) pushes p95 freshness past 30 minutes.

The goal is sub-2-minute freshness for webhook-capable sources and bounded staleness for everything else, without rewriting the extractors.

## Design

Three changes, in dependency order:

### 1. Event bus in front of the extractors

Every change signal — webhook, poll diff, manual refresh — becomes a `source.changed` event on the queue with `(source_id, object_ref, change_kind, observed_at)`. Extractors consume events instead of being driven by the sweep.

- Dedup window of 30s per `(source_id, object_ref)` so webhook storms collapse into one extraction.
- Events are idempotent claims, not work items: the consumer re-checks the object's version before extracting, so replays are safe.

### 2. Per-source token buckets

The sweep's implicit fairness (every source gets a turn) disappears with a shared queue, so fairness becomes explicit: each source gets a token bucket sized by its API quota. The consumer takes the next event whose source has tokens. Hot sources can't starve cold ones because their bucket drains.

### 3. Polling demoted to a fallback

Sources without webhooks keep the sweep, but it only *emits events* — it shares the dedup, fairness, and retry machinery with the webhook path. One code path to maintain after the cutover, not two.

## What we are explicitly not doing

- **Not** moving extraction itself to streaming. Extractors stay batch; only dispatch changes.
- **Not** adding a new queue technology. This rides the existing Celery + Redis setup until quota pressure proves otherwise.
- **Not** preserving the old sweep behind a flag. When the bus path passes the soak test, the sweep dispatch is deleted in the same PR.

## Rollout

1. Bus + consumers shadow-run against the sweep for one week; we compare extraction counts per source (they should match within dedup tolerance).
2. Webhook sources cut over first — they're the ones with user-visible freshness wins.
3. Poll sources cut over per connector as each passes the soak comparison.
4. Delete the sweep dispatcher.

## Open questions

- Whether manual "refresh now" should jump the token bucket (leaning yes, capped at 1/min per source).
- Where freshness SLOs surface to users — per-source badge vs. workspace-level indicator.

---

*Comments welcome — select any line to leave one inline.*
