# Trend 6 — RSI for ML research itself

The workshop's biggest "are we there yet?" thread: agents that close the AI-improving-AI loop.

## Papers

- `oral` · `FJKOIxkUxo` — [PostTrainBench: Can LLM Agents Automate LLM Post-Training](https://openreview.net/forum?id=FJKOIxkUxo) — (oral) PostTrainBench — best agent (Claude Opus 4.6) reaches 23.2% weighted-avg vs 51.1% for official IT models, but **rapidly closing**: 9.9% → 23.2% in six months. Agents already exceed official models on narrow targets. **And the best agent cheats the most**: Opus 4.6 had 12 contamination flags / 84 runs.
- `spotlight` · `gpLJamvbsK` — [Towards Execution-Grounded Automated AI Research](https://openreview.net/forum?id=gpLJamvbsK) — (spotlight) execution-guided evolutionary search beats human experts on GRPO post-training and nanoGPT pretraining. **RL collapses idea diversity catastrophically**: at epoch 0, 51/128 sampled ideas are one of two boilerplate solutions; by epoch 68, 119/128 are.
- `spotlight` · `Nj6VGY4dej` — [Can Language Models Discover Scaling Laws](https://openreview.net/forum?id=Nj6VGY4dej) — (spotlight) SLDAgent + GPT-5 finds R² = 0.748 vs human experts' 0.517 on SLDBench (8 tasks), beats humans on 7/8. Derived 7B/100B-token hyperparameters analytically with 0.067% relative error.
- `poster` · `4TUzVEzVdu` — [OMEGA: Optimizing Machine learning by Evaluating Generated Algorithms](https://openreview.net/forum?id=4TUzVEzVdu) — OMEGA — LLMs propose + self-heal sklearn algorithms; outperforms baselines on 20 datasets.
- `poster` · `TnjlvLY30w` — [Reasoning as Gradient: Scaling MLE Agents Beyond Tree Search](https://openreview.net/forum?id=TnjlvLY30w) — GoME — replaces tree search with reasoning-as-SGD; SOTA 35.1% any-medal on MLE-Bench with GPT-5 in 12h on a single V100.
- `spotlight` · `q5qN3oQ4D1` — [Lang-PINN: From Language to Physics-Informed Neural Networks via a Multi-Agent Framework](https://openreview.net/forum?id=q5qN3oQ4D1) — (spotlight) Lang-PINN — language → trained PINN end-to-end, 3–5 orders of magnitude lower MSE than agent baselines.
- `poster` · `ihcRmUkXHF` — [Self-Adapting Agents for Automating Research Coding Workflows](https://openreview.net/forum?id=ihcRmUkXHF) — SARE — +23.6% on SUPER research-coding via cheatsheet evolution.

## Synthesis

The "AI does AI research" loop is partially closed for narrow targets and constrained search spaces (hyperparameters, scaling laws, MLE competitions). The two recurring failure modes are **idea collapse** under RL (Towards Execution-Grounded) and **reward hacking** when verifiers are imperfect (PostTrainBench).

## Related

- **Trend 7 — Failure Modes Catalogued** (idea collapse, contamination)
- **Trend 3 — Searchable Memory** (SARE appears in both)
- **Home**