Reading group · Week 24

What we're reading from ICLR 2026

Five papers from this year's crop that changed how we think about agent memory, retrieval, and multi-agent orchestration. Notes from Tuesday's reading group.

1. Decaying Traces: Memory Consolidation for Long-Horizon Agents

Oral · Representation learning track

The authors treat agent memory like a spaced-repetition system: episodic traces decay unless they're retrieved, and retrieval itself strengthens consolidation into a semantic store. The eval on 30-day agent deployments is the interesting part — most memory papers test on synthetic recall, this one measures whether the agent changes behavior because of what it remembers.

Our take: the decay-on-non-retrieval mechanism maps almost exactly onto how we age out session transcripts. Worth prototyping against our recall benchmarks.

2. Retrieval Is a Policy, Not a Lookup

Spotlight

Frames the retrieve-or-not decision as a learned policy with a token-budget penalty, instead of always-retrieve RAG. The headline number: 41% fewer retrieval calls at equal downstream accuracy on multi-hop QA. The ablation that matters is Table 4 — most of the win comes from learning when not to retrieve.

Our take: we do this with heuristics today (query length, entity novelty). A learned gate is overkill for us at current scale, but the eval framework is stealable.

3. Schema Drift in Tool-Using Agents

Poster

A careful empirical study of what happens when an API's schema changes under a deployed agent. The finding that stuck: agents fail silently 3× more often when a field is renamed than when it's removed, because validation catches absence but not aliasing.

Our take: argues for contract tests on tool schemas, not just response validation. We filed two tickets out of this one.

4. Many Hands: Credit Assignment in Multi-Agent Pipelines

Poster

When a five-agent pipeline produces a wrong answer, which agent do you retrain? Their counterfactual replay method re-runs the pipeline with each agent swapped for a reference policy and attributes blame by outcome delta. Expensive, but it turns a debugging art into a measurement.

5. The Context Window Is Not the Working Set

Oral

The provocation: long-context models still behave like they have a small working set, attending to a power-law subset of tokens. Performance on their "needle-swarm" benchmark saturates well before the context limit. External memory with explicit retrieval beats raw context-stuffing on cost and accuracy past ~60k tokens.

Our take: validates the architecture bet. Stuff less, retrieve better.