Table of Contents
Fetching ...

Bootstrapping Task Spaces for Self-Improvement

Minqi Jiang, Andrei Lupu, Yoram Bachrach

TL;DR

This work addresses enabling large language models to self-improve over extended inference horizons without training on full multi-step episodes. It introduces Exploratory Iteration (ExIt), a turn-level autocurriculum that constructs new self-improvement task instances by selecting informative partial histories and expanding them into longer iteration chains, guided by Group-Relative Policy Optimization (GRPO) and augmented with self-divergence and a diversity bonus. ExIt yields emergent autocurricula and increased task diversity, leading to improved inference-time self-improvement across competition math, multi-turn tool-use, and ML engineering tasks. The approach emphasizes task-space exploration aligned with the model’s evolving capabilities, enabling robust, diverse improvements beyond the training depth and with practical implications for scaffolds that rely on iterative reasoning.

Abstract

Progress in many task domains emerges from repeated revisions to previous solution attempts. Training agents that can reliably self-improve over such sequences at inference-time is a natural target for reinforcement learning (RL), yet the naive approach assumes a fixed maximum iteration depth, which can be both costly and arbitrary. We present Exploratory Iteration (ExIt), a family of autocurriculum RL methods that directly exploits the recurrent structure of self-improvement tasks to train LLMs to perform multi-step self-improvement at inference-time while only training on the most informative single-step iterations. ExIt grows a task space by selectively sampling the most informative intermediate, partial histories encountered during an episode for continued iteration, treating these starting points as new self-iteration task instances to train a self-improvement policy. ExIt can further pair with explicit exploration mechanisms to sustain greater task diversity. Across several domains, encompassing competition math, multi-turn tool-use, and machine learning engineering, we demonstrate that ExIt strategies, starting from either a single or many task instances, can produce policies exhibiting strong inference-time self-improvement on held-out task instances, and the ability to iterate towards higher performance over a step budget extending beyond the average iteration depth encountered during training.

Bootstrapping Task Spaces for Self-Improvement

TL;DR

This work addresses enabling large language models to self-improve over extended inference horizons without training on full multi-step episodes. It introduces Exploratory Iteration (ExIt), a turn-level autocurriculum that constructs new self-improvement task instances by selecting informative partial histories and expanding them into longer iteration chains, guided by Group-Relative Policy Optimization (GRPO) and augmented with self-divergence and a diversity bonus. ExIt yields emergent autocurricula and increased task diversity, leading to improved inference-time self-improvement across competition math, multi-turn tool-use, and ML engineering tasks. The approach emphasizes task-space exploration aligned with the model’s evolving capabilities, enabling robust, diverse improvements beyond the training depth and with practical implications for scaffolds that rely on iterative reasoning.

Abstract

Progress in many task domains emerges from repeated revisions to previous solution attempts. Training agents that can reliably self-improve over such sequences at inference-time is a natural target for reinforcement learning (RL), yet the naive approach assumes a fixed maximum iteration depth, which can be both costly and arbitrary. We present Exploratory Iteration (ExIt), a family of autocurriculum RL methods that directly exploits the recurrent structure of self-improvement tasks to train LLMs to perform multi-step self-improvement at inference-time while only training on the most informative single-step iterations. ExIt grows a task space by selectively sampling the most informative intermediate, partial histories encountered during an episode for continued iteration, treating these starting points as new self-iteration task instances to train a self-improvement policy. ExIt can further pair with explicit exploration mechanisms to sustain greater task diversity. Across several domains, encompassing competition math, multi-turn tool-use, and machine learning engineering, we demonstrate that ExIt strategies, starting from either a single or many task instances, can produce policies exhibiting strong inference-time self-improvement on held-out task instances, and the ability to iterate towards higher performance over a step budget extending beyond the average iteration depth encountered during training.

Paper Structure

This paper contains 26 sections, 8 equations, 11 figures, 7 tables, 1 algorithm.

Figures (11)

  • Figure 1: Simultaneously training with other tasks makes random-acts-of-pizza learnable.
  • Figure 2: Overview of ExIt strategies. Each episode samples a new task (at turn 0) or selects a partial turn history from a previous episode as a starting point for self-iteration (either self-improvement or divergence). Partial histories are sampled by prioritizing those that led to higher GRPO group return variance.
  • Figure 3: Left: Mean accuracy on all math test splits. Center: Mean first-turn accuracy on the multi-turn tool-use test split. Right: Mean total task return on the multi-turn tool-use test split. Results are avg@8 values across 3 training runs per method.
  • Figure 4: Net corrections across held-out math splits, computed over 8 samples per problem per checkpoint, averaged over 3 training runs per method (and an equivalent # samples/problem for Llama-3.2-3B-Instruct).
  • Figure 5: Left: Emergent curriculum over the sampled history's recency and starting depth during MLE-bench training. Right: Normalized MLE-bench scores achieved by each method over all train and test tasks via increasing greedy-search budgets (mean over 3 training runs).
  • ...and 6 more figures