# Trend 1 — Self-play, but only with an Anchor

Asymmetric self-play (one role generates problems, the other solves them) is the most popular RSI primitive in the workshop. The bigger story, though, is that **vanilla self-play silently drifts** and most papers' contribution is the anchor that prevents that drift.

## Papers

- `oral` · `hYYeOl58xi` — [Agent0: Unleashing Self-Evolving Agents from Zero Data via Tool-Integrated Reasoning](https://openreview.net/forum?id=hYYeOl58xi) — (oral) curriculum + executor, both Qwen3-8B copies, trained from zero data. Reward includes uncertainty (target self-consistency ≈ 0.5), tool-use, and a BLEU-based diversity penalty. +18% math, +24% general reasoning over base.
- `spotlight` · `uB8YQHNsh6` — [Language Self-Play For Data-Free Training](https://openreview.net/forum?id=uB8YQHNsh6) — (spotlight) single Llama-3.2-3B split into Challenger/Solver via prompt token. Without an explicit self-rubric quality reward `R_Q`, training collapsed into Python-class gibberish.
- `spotlight` · `NYrOkAfDkP` — [GASP: Guided Asymmetric Self-Play For Coding LLMs](https://openreview.net/forum?id=NYrOkAfDkP) — (spotlight) pure asymmetric self-play (Absolute Zero) is goal-agnostic. GASP grounds it in 146 hard real LeetCode goalposts and a two-stage lemma→lift curriculum.
- `spotlight` · `lTbBFAoPSA` — [Anchored Self-Play for Code Repair](https://openreview.net/forum?id=lTbBFAoPSA) — (spotlight) generator-fixer self-play on code repair *regresses* on human-authored bugs. Anchoring with embedding similarity to 900 reference bugs flips the regression: 36.1% vs 29.1%.
- `spotlight` · `ecKAmz5vlO` — [ACE: Self-Evolving LLM Coding Framework Adversarial Unit Test Generation and Preference Optimization](https://openreview.net/forum?id=ecKAmz5vlO) — (spotlight) self-evolving solver + adversarial test generator (KTO). Verifier-test accuracy plateaus at round 4 while adversarial pressure keeps growing.
- `poster` · `b3dPMokQki` — [SAGE: Self-play Adversarial Games Enhance Large Language Model Reasoning Capabilities](https://openreview.net/forum?id=b3dPMokQki) — Setter rewarded only for problems it can solve but Opponent cannot (asymmetric difficulty band). +10% MATH, +8% MBPP.
- `poster` · `rL0GEyoMvE` — [Your Self-Play Algorithm is Secretly an Adversarial Imitator](https://openreview.net/forum?id=rL0GEyoMvE) — proves SPIN/SPPO/INPO are equivalent to adversarial imitation learning; derives a chi²-divergence variant.

## Synthesis

Anchors come in three forms: (a) a real-data goalpost set (GASP), (b) a self-rubric quality reward (LSP), or (c) embedding-similarity to a reference distribution (ASP). The shared lesson: **a closed self-play loop will optimize the proxy — you need a non-self signal to stop the slide.**

## Related

- **Trend 7 — Failure Modes Catalogued** (ASP regression, model collapse)
- **Trend 10 — Self-Curriculum from Learnability** (Agent0, GASP, SAGE all use `p ≈ 0.5` curricula)
- **Home**