# Implementation notes: hybrid retrieval (BM25 + vectors + recency)

Working notes from shipping three-way hybrid retrieval. Written for the next person who has to tune this — most of what's below we learned the hard way.

## The shape of the system

Every query fans out to three scorers and the results merge with reciprocal rank fusion:

1. **BM25** over titles, headings, and body text — wins on exact identifiers, error strings, people's names.
2. **Dense vectors** (one embedding per chunk, ~400 tokens, 15% overlap) — wins on paraphrase and "that doc about the thing" queries.
3. **Recency prior** — a multiplicative boost that decays with document age, half-life 14 days for activity-like content, 90 days for reference-like content.

RRF beat learned weight blending in our evals and has no parameters to drift. We use `k = 60`, which everyone uses, because it genuinely doesn't matter much.

## Things that mattered more than expected

**Chunking beats embedding choice.** Swapping embedding models moved recall@10 by ~2 points. Fixing chunk boundaries to respect headings and never split tables moved it by 9. If a chunk starts mid-sentence, no embedding model saves you.

**Titles need their own index.** A third of real queries are degenerate title lookups ("q3 roadmap"). Title-field BM25 with a 3× field boost handles these; before that, vector search would happily return four *other* roadmaps.

**The recency prior must be class-conditional.** A flat half-life buried evergreen reference docs under last week's meeting notes. Classifying content as activity vs. reference (cheap heuristic: edit cadence + structure) and giving each its own half-life fixed the single biggest complaint from dogfooding.

## Things that mattered less than expected

- Reranking with a cross-encoder: +1.5 recall@10, +180ms p95. Shipped behind a flag, off by default.
- Query expansion: helped synthetic evals, was a wash on real query logs. Real queries are shorter and weirder than synthetic ones.

## Failure modes we now test for

| Query type | Failure before | Guard now |
|---|---|---|
| Exact error string | Vectors return "similar" errors | BM25 exact-match short-circuit |
| Person name | Embeddings collapse name variants | Field-boosted BM25 on author/mentions |
| "Latest X" | Stale doc with perfect text match | Recency prior + freshness tiebreak |
| Long pasted snippet | Query embedding saturates | Truncate to head+tail before embedding |

## Eval setup

180 real queries harvested from logs, labeled by the team (about 4 hours of grading). Synthetic benchmarks were directionally misleading twice; we don't use them for decisions anymore. The harness re-runs nightly against the live index and posts deltas — retrieval regressions show up as a diff in the morning, not as a vibe two weeks later.

---

*Questions or war stories? Select a line and leave a comment.*