Reading group · Week 24
What we're reading from ICLR 2026
1. Decaying Traces: Memory Consolidation for Long-Horizon Agents
Oral · Representation learning track
The authors treat agent memory like a spaced-repetition system: episodic traces decay unless they're retrieved, and retrieval itself strengthens consolidation into a semantic store. The eval on 30-day agent deployments is the interesting part — most memory papers test on synthetic recall, this one measures whether the agent changes behavior because of what it remembers.
Our take: the decay-on-non-retrieval mechanism maps almost exactly onto how we age out session transcripts. Worth prototyping against our recall benchmarks.
2. Retrieval Is a Policy, Not a Lookup
Spotlight
Frames the retrieve-or-not decision as a learned policy with a token-budget penalty, instead of always-retrieve RAG. The headline number: 41% fewer retrieval calls at equal downstream accuracy on multi-hop QA. The ablation that matters is Table 4 — most of the win comes from learning when not to retrieve.
Our take: we do this with heuristics today (query length, entity novelty). A learned gate is overkill for us at current scale, but the eval framework is stealable.
3. Schema Drift in Tool-Using Agents
Poster
A careful empirical study of what happens when an API's schema changes under a deployed agent. The finding that stuck: agents fail silently 3× more often when a field is renamed than when it's removed, because validation catches absence but not aliasing.
Our take: argues for contract tests on tool schemas, not just response validation. We filed two tickets out of this one.
4. Many Hands: Credit Assignment in Multi-Agent Pipelines
Poster
When a five-agent pipeline produces a wrong answer, which agent do you retrain? Their counterfactual replay method re-runs the pipeline with each agent swapped for a reference policy and attributes blame by outcome delta. Expensive, but it turns a debugging art into a measurement.
5. The Context Window Is Not the Working Set
Oral
The provocation: long-context models still behave like they have a small working set, attending to a power-law subset of tokens. Performance on their "needle-swarm" benchmark saturates well before the context limit. External memory with explicit retrieval beats raw context-stuffing on cost and accuracy past ~60k tokens.
Our take: validates the architecture bet. Stuff less, retrieve better.