Think Longer to Explore Deeper: Learn to Explore In-Context via Length-Incentivized Reinforcement Learning

Futing Wang; Jianhao Yan; Yun Luo; Ganqu Cui; Zhi Wang; Xiaoye Qu; Yue Zhang; Yu Cheng; Tao Lin

Think Longer to Explore Deeper: Learn to Explore In-Context via Length-Incentivized Reinforcement Learning

Futing Wang, Jianhao Yan, Yun Luo, Ganqu Cui, Zhi Wang, Xiaoye Qu, Yue Zhang, Yu Cheng, Tao Lin

TL;DR

This work addresses the challenge of test-time scaling for LLMs by identifying the Shallow Exploration Trap: broad in-context state coverage requires long reasoning trajectories, but longer sequences are exponentially unlikely to be sampled. It introduces Length-Incentivized Exploration (LIE), a two-part RL recipe that first raises the upper bound on exploration via a length-based reward and then curbs wasteful repetition with a redundancy penalty, promoting meaningful state coverage within a trajectory. Grounded in count-based exploration theory and an in-context state abstraction (e.g., last-$n$-gram patterns), LIE yields longer, more diverse reasoning paths and translates into measurable gains—averaging $+4.4 ext{%}$ in-domain and $+2.7 ext{%}$ out-of-domain on several models. Across eight benchmarks and multiple model families, LIE demonstrates robust improvements, scalable effects with continual curriculum training, and favorable shifts in reasoning behaviors such as verification and backtracking, highlighting practical potential for test-time scaling in real-world settings.

Abstract

Achieving effective test-time scaling requires models to engage in In-Context Exploration -- the intrinsic ability to generate, verify, and refine multiple reasoning hypotheses within a single continuous context. Grounded in State Coverage theory, our analysis identifies a critical bottleneck to enabling this capability: while broader state coverage requires longer reasoning trajectories, the probability of sampling such sequences decays exponentially during autoregressive generation, a phenomenon we term the ``Shallow Exploration Trap''. To bridge this gap, we propose Length-Incentivized Exploration(\method). This simple yet effective recipe explicitly encourages models to explore more via a length-based reward coupled with a redundancy penalty, thereby maximizing state coverage in two-step manner. Comprehensive experiments across different models (Qwen3, Llama) demonstrate that \method effectively incentivize in-context exploration. As a result, our method achieves an average improvement of 4.4\% on in-domain tasks and a 2.7\% gain on out-of-domain benchmarks.

Think Longer to Explore Deeper: Learn to Explore In-Context via Length-Incentivized Reinforcement Learning

TL;DR

-gram patterns), LIE yields longer, more diverse reasoning paths and translates into measurable gains—averaging

in-domain and

out-of-domain on several models. Across eight benchmarks and multiple model families, LIE demonstrates robust improvements, scalable effects with continual curriculum training, and favorable shifts in reasoning behaviors such as verification and backtracking, highlighting practical potential for test-time scaling in real-world settings.

Abstract

Paper Structure (63 sections, 5 theorems, 21 equations, 17 figures, 7 tables)

This paper contains 63 sections, 5 theorems, 21 equations, 17 figures, 7 tables.

Introduction
Related Work
Scaling Test-Time Compute via Long CoT.
Length-Aware Reasoning.
Background
MDP Formulation of LLM Reasoning
LLM Reasoning as an MDP.
Reinforcement Learning for LLM Reasoning.
Theoretical Foundation: Incentivizing State Coverage via Count-Based Exploration
Theoretical Guarantees for Exploration.
In-Context Exploration
In-Context State Space
Defining in-context states.
State abstraction.
Quantifying in-context exploration.
...and 48 more sections

Key Result

Theorem 3.1

In an MAB setting, let $L(T)=\mathbb{E}[\sum_{t=1}^T\left(R^*-R(a_t)\right)]$ denote the total regret over $T$ steps, where $R(a)=\mathbb{E}_{\mathcal{R}^a}[R]$ is the expected reward for any action $a$. $\mathcal{R}^a$ is the reward distribution for action $a$ and $R^*$ is the reward for the optima where $\Delta_a$ is the reward gap between action $a$ and the optimal action.

Figures (17)

Figure 1: The difference between In-Context Exploration and Training Exploration. Our framework distinguishes between the exploration of the training process and in-context inference. In the training phase, reinforcement learning incentivizes the model to explore and learn from diverse state distributions. In contrast, during test-time inference, in-context exploration empowers the model to actively traverse and navigate states.
Figure 2: The Length Bottleneck of In-Context Exploration. (a) Capacity: Trajectory length dictates the maximum possible state coverage (Proposition \ref{['propos:length_as_capacity']}). (b) "Shallow Exploration Trap": The probability of reaching deep states decays exponentially (Lemma \ref{['lemma:shallow_exploration_trap']}), preventing the model from utilizing this capacity.
Figure 3: The training dynamics of $C_{\text{context}}$ and $R_{\text{Context}}$ in GRPO and GSPO on Qwen3-4B-Base. Two limitations are observed: (1) Shallow Exploration Trap: GRPO faces bottlenecks in extending trajectory length and performance, while GSPO shows slow length expansion. (2) Degrading Information Density: Both methods display degradation in ratio over time.
Figure 4: $C_{\text{context}}$, $R_{\text{context}}$, response length, and performance on the valid dataset comparing GSPO baseline and our recipe.
Figure 5: Test-time extrapolation performance. While standard baselines saturate or degrade when forced beyond their learned policy length, the Length-Incentivized Exploration recipe exhibits a superior scaling curve.
...and 12 more figures

Theorems & Definitions (8)

Theorem 3.1: Optimality of Count-based Exploration auer2002using
Remark 3.2
Proposition 4.1: Length as the Capacity for Exploration
Lemma 4.2: Exponential Decay of Long Sequences
Remark 4.3: The Exploration-Length Conflict
Theorem 1.1
Lemma 1.2: Exponential Decay of Long Sequences
proof

Think Longer to Explore Deeper: Learn to Explore In-Context via Length-Incentivized Reinforcement Learning

TL;DR

Abstract

Think Longer to Explore Deeper: Learn to Explore In-Context via Length-Incentivized Reinforcement Learning

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (17)

Theorems & Definitions (8)