# Trend 2 — The bottleneck is the critic, not the generator

Across many papers, the model can solve problems given the right reward signal, but generic LLM judges aren't that signal. Multiple workshop papers replace prompted judges with **co-evolved or learned rubrics**, and several show big jumps from doing so.

## Papers

- `spotlight` · `aA2PXFH2Cp` — [Self-Evolving Rubrics: Interpretable Instance-Level Criteria for Scalable RL](https://openreview.net/forum?id=aA2PXFH2Cp) — (spotlight) rubric generator trained with a discriminative reward. **Counterintuitive headline: a 0.6B frozen judge produces better downstream policies (70.0 avg) than a 14B judge (66.7).** Hypothesis: small judges force solution-embedded rubrics.
- `spotlight` · `d40v7Qcpi4` — [Presenting a Paper is an Art: Self-Improvement Aesthetic Agents for Academic Presentations](https://openreview.net/forum?id=d40v7Qcpi4) — (spotlight) for academic-slide aesthetics, *initial generation quality does not predict self-correction ability*. Replacing prompted critic with multi-task-GRPO PresAesth jumps weaker-base from 3.2 → 8.0+ in 3 iterations.
- `poster` · `WA6q2pNQhj` — [Compute as Teacher: Turning Inference Compute Into Reference-Free Supervision](https://openreview.net/forum?id=WA6q2pNQhj) — synthesize a pseudo-reference from G parallel rollouts and train against it. Up to +30% over initial policy on HealthBench at 9× less test-time compute.
- `poster` · `s06wgoO65a` — [MAPPA: Scaling Multiagent Systems with Process Rewards](https://openreview.net/forum?id=s06wgoO65a) — per-action process rewards from an LLM coach lift AIME by +5–17.5pp.
- `poster` · `8hYSvUpJBA` — [Self-Improving VLM Judges Without Human Annotations](https://openreview.net/forum?id=8hYSvUpJBA) — Llama-3.2-11B judge improves 0.38 → 0.51 on VL-RewardBench (beats Llama-3.2-90B and Claude 3.5 Sonnet) with zero human annotations.
- `poster` · `Hv5hDfbhuB` — [Beyond Solving: A Closer Look at LLMs as Solution Verifiers](https://openreview.net/forum?id=Hv5hDfbhuB) — across 37 verifiers: **cross-family verification beats self- and intra-family**. Post-training tends to weaken self-verification but strengthen cross-family verification.
- `poster` · `uXTXcZ615k` — [Differentiable Evolutionary Reinforcement Learning](https://openreview.net/forum?id=uXTXcZ615k) — DERL evolves the reward function itself with a meta-optimizer; SOTA on ALFWorld/ScienceWorld.
- `poster` · `38t39AFZPE` — [Duel-Evolve: Pairwise Preference Black-Box Optimization of LLM Responses](https://openreview.net/forum?id=38t39AFZPE) — Duel-Evolve: pure pairwise preferences with Bayesian Bradley-Terry; 94% MathBench (+20pp), 37% LiveCodeBench (+11pp).

## Synthesis

RSI scaling laws are largely about reward signal quality, not generator capability. The "small-judge advantage" finding is one of the workshop's cleanest counterintuitive results — it implies the right move for new RSI loops is **co-train a small specialized critic rather than prompt a frontier judge**.

## Related

- **Trend 7 — Failure Modes Catalogued** (reward hacking, verifier corruption)
- **Trend 6 — RSI for ML Research** (search benefits from better verifiers)
- **Home**