Table of Contents
Fetching ...

Think Longer to Explore Deeper: Learn to Explore In-Context via Length-Incentivized Reinforcement Learning

Futing Wang, Jianhao Yan, Yun Luo, Ganqu Cui, Zhi Wang, Xiaoye Qu, Yue Zhang, Yu Cheng, Tao Lin

TL;DR

This work addresses the challenge of test-time scaling for LLMs by identifying the Shallow Exploration Trap: broad in-context state coverage requires long reasoning trajectories, but longer sequences are exponentially unlikely to be sampled. It introduces Length-Incentivized Exploration (LIE), a two-part RL recipe that first raises the upper bound on exploration via a length-based reward and then curbs wasteful repetition with a redundancy penalty, promoting meaningful state coverage within a trajectory. Grounded in count-based exploration theory and an in-context state abstraction (e.g., last-$n$-gram patterns), LIE yields longer, more diverse reasoning paths and translates into measurable gains—averaging $+4.4 ext{%}$ in-domain and $+2.7 ext{%}$ out-of-domain on several models. Across eight benchmarks and multiple model families, LIE demonstrates robust improvements, scalable effects with continual curriculum training, and favorable shifts in reasoning behaviors such as verification and backtracking, highlighting practical potential for test-time scaling in real-world settings.

Abstract

Achieving effective test-time scaling requires models to engage in In-Context Exploration -- the intrinsic ability to generate, verify, and refine multiple reasoning hypotheses within a single continuous context. Grounded in State Coverage theory, our analysis identifies a critical bottleneck to enabling this capability: while broader state coverage requires longer reasoning trajectories, the probability of sampling such sequences decays exponentially during autoregressive generation, a phenomenon we term the ``Shallow Exploration Trap''. To bridge this gap, we propose Length-Incentivized Exploration(\method). This simple yet effective recipe explicitly encourages models to explore more via a length-based reward coupled with a redundancy penalty, thereby maximizing state coverage in two-step manner. Comprehensive experiments across different models (Qwen3, Llama) demonstrate that \method effectively incentivize in-context exploration. As a result, our method achieves an average improvement of 4.4\% on in-domain tasks and a 2.7\% gain on out-of-domain benchmarks.

Think Longer to Explore Deeper: Learn to Explore In-Context via Length-Incentivized Reinforcement Learning

TL;DR

This work addresses the challenge of test-time scaling for LLMs by identifying the Shallow Exploration Trap: broad in-context state coverage requires long reasoning trajectories, but longer sequences are exponentially unlikely to be sampled. It introduces Length-Incentivized Exploration (LIE), a two-part RL recipe that first raises the upper bound on exploration via a length-based reward and then curbs wasteful repetition with a redundancy penalty, promoting meaningful state coverage within a trajectory. Grounded in count-based exploration theory and an in-context state abstraction (e.g., last--gram patterns), LIE yields longer, more diverse reasoning paths and translates into measurable gains—averaging in-domain and out-of-domain on several models. Across eight benchmarks and multiple model families, LIE demonstrates robust improvements, scalable effects with continual curriculum training, and favorable shifts in reasoning behaviors such as verification and backtracking, highlighting practical potential for test-time scaling in real-world settings.

Abstract

Achieving effective test-time scaling requires models to engage in In-Context Exploration -- the intrinsic ability to generate, verify, and refine multiple reasoning hypotheses within a single continuous context. Grounded in State Coverage theory, our analysis identifies a critical bottleneck to enabling this capability: while broader state coverage requires longer reasoning trajectories, the probability of sampling such sequences decays exponentially during autoregressive generation, a phenomenon we term the ``Shallow Exploration Trap''. To bridge this gap, we propose Length-Incentivized Exploration(\method). This simple yet effective recipe explicitly encourages models to explore more via a length-based reward coupled with a redundancy penalty, thereby maximizing state coverage in two-step manner. Comprehensive experiments across different models (Qwen3, Llama) demonstrate that \method effectively incentivize in-context exploration. As a result, our method achieves an average improvement of 4.4\% on in-domain tasks and a 2.7\% gain on out-of-domain benchmarks.
Paper Structure (63 sections, 5 theorems, 21 equations, 17 figures, 7 tables)

This paper contains 63 sections, 5 theorems, 21 equations, 17 figures, 7 tables.

Key Result

Theorem 3.1

In an MAB setting, let $L(T)=\mathbb{E}[\sum_{t=1}^T\left(R^*-R(a_t)\right)]$ denote the total regret over $T$ steps, where $R(a)=\mathbb{E}_{\mathcal{R}^a}[R]$ is the expected reward for any action $a$. $\mathcal{R}^a$ is the reward distribution for action $a$ and $R^*$ is the reward for the optima where $\Delta_a$ is the reward gap between action $a$ and the optimal action.

Figures (17)

  • Figure 1: The difference between In-Context Exploration and Training Exploration. Our framework distinguishes between the exploration of the training process and in-context inference. In the training phase, reinforcement learning incentivizes the model to explore and learn from diverse state distributions. In contrast, during test-time inference, in-context exploration empowers the model to actively traverse and navigate states.
  • Figure 2: The Length Bottleneck of In-Context Exploration. (a) Capacity: Trajectory length dictates the maximum possible state coverage (Proposition \ref{['propos:length_as_capacity']}). (b) "Shallow Exploration Trap": The probability of reaching deep states decays exponentially (Lemma \ref{['lemma:shallow_exploration_trap']}), preventing the model from utilizing this capacity.
  • Figure 3: The training dynamics of $C_{\text{context}}$ and $R_{\text{Context}}$ in GRPO and GSPO on Qwen3-4B-Base. Two limitations are observed: (1) Shallow Exploration Trap: GRPO faces bottlenecks in extending trajectory length and performance, while GSPO shows slow length expansion. (2) Degrading Information Density: Both methods display degradation in ratio over time.
  • Figure 4: $C_{\text{context}}$, $R_{\text{context}}$, response length, and performance on the valid dataset comparing GSPO baseline and our recipe.
  • Figure 5: Test-time extrapolation performance. While standard baselines saturate or degrade when forced beyond their learned policy length, the Length-Incentivized Exploration recipe exhibits a superior scaling curve.
  • ...and 12 more figures

Theorems & Definitions (8)

  • Theorem 3.1: Optimality of Count-based Exploration auer2002using
  • Remark 3.2
  • Proposition 4.1: Length as the Capacity for Exploration
  • Lemma 4.2: Exponential Decay of Long Sequences
  • Remark 4.3: The Exploration-Length Conflict
  • Theorem 1.1
  • Lemma 1.2: Exponential Decay of Long Sequences
  • proof