Table of Contents
Fetching ...

e3: Learning to Explore Enables Extrapolation of Test-Time Compute for LLMs

Amrith Setlur, Matthew Y. R. Yang, Charlie Snell, Jeremy Greer, Ian Wu, Virginia Smith, Max Simchowitz, Aviral Kumar

TL;DR

This work tackles extrapolating test-time compute in LLMs by enabling in-context exploration through a three-part recipe (e3): leveraging base-model asymmetries to enable chaining of reasoning steps, using negative gradients in RL to promote longer, more diverse traces, and employing a coupled data-budget curriculum to structure exploration. The approach is demonstrated with a sub-$2$B model (Qwen3-1.7B) achieving state-of-the-art results on AIME'25 and HMMT'25 and capable of extrapolating to $\approx 2\times$ the training budget. Through theoretical (p^k) and empirical analyses, the authors show that negative gradients drive structured exploration and that coupling curriculum design with difficulty and budget is key to effective extrapolation. The results suggest that proper exploration dynamics, rather than mere scaling or prompt-based forcing, are crucial for unlocking extrapolation in reasoning tasks at test time.

Abstract

Test-time scaling offers a promising path to improve LLM reasoning by utilizing more compute at inference time; however, the true promise of this paradigm lies in extrapolation (i.e., improvement in performance on hard problems as LLMs keep "thinking" for longer, beyond the maximum token budget they were trained on). Surprisingly, we find that most existing reasoning models do not extrapolate well. We show that one way to enable extrapolation is by training the LLM to perform in-context exploration: training the LLM to effectively spend its test time budget by chaining operations (such as generation, verification, refinement, etc.), or testing multiple hypotheses before it commits to an answer. To enable in-context exploration, we identify three key ingredients as part of our recipe e3: (1) chaining skills that the base LLM has asymmetric competence in, e.g., chaining verification (easy) with generation (hard), as a way to implement in-context search; (2) leveraging "negative" gradients from incorrect traces to amplify exploration during RL, resulting in longer search traces that chains additional asymmetries; and (3) coupling task difficulty with training token budget during training via a specifically-designed curriculum to structure in-context exploration. Our recipe e3 produces the best known 1.7B model according to AIME'25 and HMMT'25 scores, and extrapolates to 2x the training token budget. Our e3-1.7B model not only attains high pass@1 scores, but also improves pass@k over the base model.

e3: Learning to Explore Enables Extrapolation of Test-Time Compute for LLMs

TL;DR

This work tackles extrapolating test-time compute in LLMs by enabling in-context exploration through a three-part recipe (e3): leveraging base-model asymmetries to enable chaining of reasoning steps, using negative gradients in RL to promote longer, more diverse traces, and employing a coupled data-budget curriculum to structure exploration. The approach is demonstrated with a sub-B model (Qwen3-1.7B) achieving state-of-the-art results on AIME'25 and HMMT'25 and capable of extrapolating to the training budget. Through theoretical (p^k) and empirical analyses, the authors show that negative gradients drive structured exploration and that coupling curriculum design with difficulty and budget is key to effective extrapolation. The results suggest that proper exploration dynamics, rather than mere scaling or prompt-based forcing, are crucial for unlocking extrapolation in reasoning tasks at test time.

Abstract

Test-time scaling offers a promising path to improve LLM reasoning by utilizing more compute at inference time; however, the true promise of this paradigm lies in extrapolation (i.e., improvement in performance on hard problems as LLMs keep "thinking" for longer, beyond the maximum token budget they were trained on). Surprisingly, we find that most existing reasoning models do not extrapolate well. We show that one way to enable extrapolation is by training the LLM to perform in-context exploration: training the LLM to effectively spend its test time budget by chaining operations (such as generation, verification, refinement, etc.), or testing multiple hypotheses before it commits to an answer. To enable in-context exploration, we identify three key ingredients as part of our recipe e3: (1) chaining skills that the base LLM has asymmetric competence in, e.g., chaining verification (easy) with generation (hard), as a way to implement in-context search; (2) leveraging "negative" gradients from incorrect traces to amplify exploration during RL, resulting in longer search traces that chains additional asymmetries; and (3) coupling task difficulty with training token budget during training via a specifically-designed curriculum to structure in-context exploration. Our recipe e3 produces the best known 1.7B model according to AIME'25 and HMMT'25 scores, and extrapolates to 2x the training token budget. Our e3-1.7B model not only attains high pass@1 scores, but also improves pass@k over the base model.

Paper Structure

This paper contains 27 sections, 4 theorems, 34 equations, 17 figures, 5 tables.

Key Result

Theorem 5.1

At state $\boldsymbol{s}$, if the most likely action under $\pi$ is $a_1 \mathrel{\mathop=}:\mathop{\mathrm{arg\,max}}\limits_{a'}\pi(a' | \boldsymbol{s}) \neq a^\star$, then, for any $\pi$, a negative stochastic gradient step increases the entropy of $\pi(\cdot|\boldsymbol{s})$ with prob. $\geq \pi

Figures (17)

  • Figure 1: In-context exploration enables extrapolation of test-time compute (e3): (a) By (i) chaining asymmetric capabilities of the base model, e.g., reliably self-verifying responses after generating them; (ii) lengthening model responses by chaining more asymmetries until the correct answer is discovered by utilizing the "negative" part of the RL policy gradient generated from incorrect responses; and (iii) coupling data & budget curricula for RL training that carefully structures exploration by sequentially training models on different datasets and training compute budgets. (b) Qwen3-1.7B fine-tuned with e3 outperforms <2B models on AIME' and HMMT' 25 and even some larger 7B/32B models (see full results in Tab. \ref{['tab:passk']} and Fig. \ref{['fig:curriculum-panel-1']}).
  • Figure 2: Accuracy of various open-source models at different budgets on AIME 2025. Performance gains diminish as the test-time budget increases, with virtually no gains from 16k to 32k.
  • Figure 3: Measuring asymmetry (Def. \ref{['def:asymmetry']}) & pass@k on 0.9Cdown. Pass@k improves more for all $k$ as the number of chained asymmetries increases in a trace from Llama3.2-3B.
  • Figure 4: RL training with and without asymmetries in the base model. When asymmetries such as the VG gap are present (e.g., in 0.9Cdown), RL training amplifies response length by chaining more asymmetries to explore in-context, where the probability of success improves with higher length on both $B_\mathrm{tr}$ and extrapolation regimes. On the other hand, when VG gap is absent in $\pi_b$ (e.g., in 0.9Mult), increases in length and extrapolation performance are subdued. When we explicitly train on a base model fine-tuned to verify 0.9Mult (a setting we refer to as the 0.9Mult-V), we again observe upward length and extrapolation trends, consistent with 0.9Cdown.
  • Figure 5: RL training with and without negative gradients: When the base model admits asymmetries, negative gradients promote in-context exploration by: (i) increasing length ((c)) and chaining asymmetries, which shows up as more verification attempts (b); and (ii) increasing token entropy and thus response diversity (d). This leads to better performance on the training budget and upon extrapolation. In (b, c), $\text{✓}$ denotes the statistic computed on correct responses and $\text{✗}$ on incorrect responses.
  • ...and 12 more figures

Theorems & Definitions (8)

  • Definition 3.1: Chaining asymmetric capabilities $p, q$ in model $\pi$.
  • Theorem 5.1: Negative gradient increases entropy when $a^\star$ is unlikely; formal version in Thm. \ref{['thm:neg-gradient-entropy-formal']}
  • Lemma E.1: Entropy gradient for the softmax bi–gram conditional
  • proof
  • Lemma E.2: Policy gradient for the conditional distribution
  • proof
  • Theorem E.3: Negative gradient increases $H(M; \boldsymbol{s})$ when $p(a^\star | \boldsymbol{s})$ is low
  • proof