Think Longer to Explore Deeper: Learn to Explore In-Context via Length-Incentivized Reinforcement Learning
Futing Wang, Jianhao Yan, Yun Luo, Ganqu Cui, Zhi Wang, Xiaoye Qu, Yue Zhang, Yu Cheng, Tao Lin
TL;DR
This work addresses the challenge of test-time scaling for LLMs by identifying the Shallow Exploration Trap: broad in-context state coverage requires long reasoning trajectories, but longer sequences are exponentially unlikely to be sampled. It introduces Length-Incentivized Exploration (LIE), a two-part RL recipe that first raises the upper bound on exploration via a length-based reward and then curbs wasteful repetition with a redundancy penalty, promoting meaningful state coverage within a trajectory. Grounded in count-based exploration theory and an in-context state abstraction (e.g., last-$n$-gram patterns), LIE yields longer, more diverse reasoning paths and translates into measurable gains—averaging $+4.4 ext{%}$ in-domain and $+2.7 ext{%}$ out-of-domain on several models. Across eight benchmarks and multiple model families, LIE demonstrates robust improvements, scalable effects with continual curriculum training, and favorable shifts in reasoning behaviors such as verification and backtracking, highlighting practical potential for test-time scaling in real-world settings.
Abstract
Achieving effective test-time scaling requires models to engage in In-Context Exploration -- the intrinsic ability to generate, verify, and refine multiple reasoning hypotheses within a single continuous context. Grounded in State Coverage theory, our analysis identifies a critical bottleneck to enabling this capability: while broader state coverage requires longer reasoning trajectories, the probability of sampling such sequences decays exponentially during autoregressive generation, a phenomenon we term the ``Shallow Exploration Trap''. To bridge this gap, we propose Length-Incentivized Exploration(\method). This simple yet effective recipe explicitly encourages models to explore more via a length-based reward coupled with a redundancy penalty, thereby maximizing state coverage in two-step manner. Comprehensive experiments across different models (Qwen3, Llama) demonstrate that \method effectively incentivize in-context exploration. As a result, our method achieves an average improvement of 4.4\% on in-domain tasks and a 2.7\% gain on out-of-domain benchmarks.
