Low-probability Tokens Sustain Exploration in Reinforcement Learning with Verifiable Reward
Guanhua Huang, Tingqiang Xu, Mingze Wang, Qi Yi, Xue Gong, Siheng Li, Ruibin Xiong, Kejiao Li, Yuhao Jiang, Bo Zhou
TL;DR
This paper identifies entropy collapse in RLVR as a bottleneck that suppresses valuable low-probability tokens, or reasoning sparks, which are essential for sustained exploration. It introduces Low-probability Regularization (Lp-Reg), which uses a proxy distribution crafted by filtering out noise tokens and renormalizing the remainder, and applies a forward KL penalty to preserve these sparks without amplifying irrelevant noise. Empirical results on five math benchmarks across Qwen models show that Lp-Reg enables stable on-policy training for around $3{,}000$ steps ($81{,}204$ GPU-hours) and achieves a state-of-the-art average accuracy of $60.17\%$, outperforming prior methods by $2.66$ percentage points. The approach demonstrates that targeting the low-probability tail of next-token distributions, rather than boosting overall entropy, yields robust exploration and practical gains in RLVR for reasoning tasks.
Abstract
Reinforcement Learning with Verifiable Rewards (RLVR) has propelled Large Language Models in complex reasoning, yet its scalability is often hindered by a training bottleneck where performance plateaus as policy entropy collapses, signaling a loss of exploration. Previous methods typically address this by maintaining high policy entropy, yet the precise mechanisms that govern meaningful exploration have remained underexplored. Our analysis suggests that an unselective focus on entropy risks amplifying irrelevant tokens and destabilizing training. This paper investigates the exploration dynamics within RLVR and identifies a key issue: the gradual elimination of valuable low-probability exploratory tokens, which we term \textbf{\textit{reasoning sparks}}. We find that while abundant in pre-trained models, these sparks are systematically extinguished during RLVR due to over-penalization, leading to a degeneracy in exploration. To address this, we introduce Low-probability Regularization (Lp-Reg). Its core mechanism regularizes the policy towards a heuristic proxy distribution. This proxy is constructed by filtering out presumed noise tokens and re-normalizing the distribution over the remaining candidates. The result is a less-noisy proxy where the probability of \textit{reasoning sparks} is amplified, which then serves as a soft regularization target to shield these valuable tokens from elimination via KL divergence. Experiments show that Lp-Reg enables stable on-policy RL, sustaining continuous scaling across $3,000$ training steps and $81,204$ GPU-hours, where baseline entropy-control methods collapse. This sustained exploration leads to state-of-the-art performance, achieving a $60.17\%$ average accuracy on five math benchmarks, an improvement of $2.66\%$ over prior methods. Code is available at https://github.com/CarlanLark/Lp-Reg.
