Bootstrapping Task Spaces for Self-Improvement
Minqi Jiang, Andrei Lupu, Yoram Bachrach
TL;DR
This work addresses enabling large language models to self-improve over extended inference horizons without training on full multi-step episodes. It introduces Exploratory Iteration (ExIt), a turn-level autocurriculum that constructs new self-improvement task instances by selecting informative partial histories and expanding them into longer iteration chains, guided by Group-Relative Policy Optimization (GRPO) and augmented with self-divergence and a diversity bonus. ExIt yields emergent autocurricula and increased task diversity, leading to improved inference-time self-improvement across competition math, multi-turn tool-use, and ML engineering tasks. The approach emphasizes task-space exploration aligned with the model’s evolving capabilities, enabling robust, diverse improvements beyond the training depth and with practical implications for scaffolds that rely on iterative reasoning.
Abstract
Progress in many task domains emerges from repeated revisions to previous solution attempts. Training agents that can reliably self-improve over such sequences at inference-time is a natural target for reinforcement learning (RL), yet the naive approach assumes a fixed maximum iteration depth, which can be both costly and arbitrary. We present Exploratory Iteration (ExIt), a family of autocurriculum RL methods that directly exploits the recurrent structure of self-improvement tasks to train LLMs to perform multi-step self-improvement at inference-time while only training on the most informative single-step iterations. ExIt grows a task space by selectively sampling the most informative intermediate, partial histories encountered during an episode for continued iteration, treating these starting points as new self-iteration task instances to train a self-improvement policy. ExIt can further pair with explicit exploration mechanisms to sustain greater task diversity. Across several domains, encompassing competition math, multi-turn tool-use, and machine learning engineering, we demonstrate that ExIt strategies, starting from either a single or many task instances, can produce policies exhibiting strong inference-time self-improvement on held-out task instances, and the ability to iterate towards higher performance over a step budget extending beyond the average iteration depth encountered during training.
