# Trend 4 — Test-time training, finally working at scale

Several papers turn fixed-context retrospection into actual weight updates at test time, with surprisingly strong results.

## Papers

- `spotlight` · `iEmRSwdzyw` — [Test-Time Self-Distillation](https://openreview.net/forum?id=iEmRSwdzyw) — (spotlight) SDPO — same model plays self-teacher (conditioned on textual feedback) and student. **The self-teacher solves <1% of LiveCodeBench-very-hard problems but its credit assignment is still informative enough to bootstrap improvement.**
- `spotlight` · `GjoUJTfXiW` — [Adaptive Meta-Curriculum for Test-Time Self-Improvement](https://openreview.net/forum?id=GjoUJTfXiW) — AMC-TSI — meta-learns the difficulty estimator + operator selector, allocates 6× more compute to hard problems, hits 95% of oracle accuracy on MATH at 2.3× less total compute.
- `poster` · `DROMQyqM52` — [Reasoning Cache: Learning to Extrapolate to Long Lengths via Short-Length RL](https://openreview.net/forum?id=DROMQyqM52) — Reasoning Cache — short-budget RL + summarize-regenerate at inference: 4B model goes 40% → 70% on HMMT 2025 with 512k test tokens.
- `short_paper` · `G0GE1xbR0w` — [Test-Time Meta-Adaptation with Self-Synthesis](https://openreview.net/forum?id=G0GE1xbR0w) — MASS — meta-learn an LLM that synthesizes per-problem training data on the fly.
- `poster` · `7fIqKZ92Fy` — [Adaptive Decoding via Test-Time Policy Learning for Self-Improving Generation](https://openreview.net/forum?id=7fIqKZ92Fy) — Adaptive Decoding — RL policy adjusts temperature/top-p as a function of hidden state; up to +88% relative on BookSum.
- `poster` · `sSwXGq1RRI` — [Constructive Distortion: Improving MLLMs with Attention-Guided Image Warping](https://openreview.net/forum?id=sSwXGq1RRI) — Constructive Distortion — re-warp the input image based on the model's own attention; gains across 10 VL benchmarks with no parameter updates.
- `poster` · `UcIFJPXqrB` — [AlphaApollo: A System for Deep Agentic Reasoning](https://openreview.net/forum?id=UcIFJPXqrB) — AlphaApollo — Qwen2.5-3B Avg@32 jumps 1.07% → 9.64% via tool-RL + iterative evolution.

## Synthesis

The line between inference and training is dissolving. The cleanest result is SDPO's: a near-zero-skill self-teacher can still drive learning if its retrospection is converted into a gradient.

## Related

- **Trend 2 — The Critic Bottleneck** (the self-teacher is itself a critic)
- **Trend 10 — Self-Curriculum from Learnability** (AMC-TSI is also a curriculum paper)
- **Home**