Table of Contents
Fetching ...

Low-probability Tokens Sustain Exploration in Reinforcement Learning with Verifiable Reward

Guanhua Huang, Tingqiang Xu, Mingze Wang, Qi Yi, Xue Gong, Siheng Li, Ruibin Xiong, Kejiao Li, Yuhao Jiang, Bo Zhou

TL;DR

This paper identifies entropy collapse in RLVR as a bottleneck that suppresses valuable low-probability tokens, or reasoning sparks, which are essential for sustained exploration. It introduces Low-probability Regularization (Lp-Reg), which uses a proxy distribution crafted by filtering out noise tokens and renormalizing the remainder, and applies a forward KL penalty to preserve these sparks without amplifying irrelevant noise. Empirical results on five math benchmarks across Qwen models show that Lp-Reg enables stable on-policy training for around $3{,}000$ steps ($81{,}204$ GPU-hours) and achieves a state-of-the-art average accuracy of $60.17\%$, outperforming prior methods by $2.66$ percentage points. The approach demonstrates that targeting the low-probability tail of next-token distributions, rather than boosting overall entropy, yields robust exploration and practical gains in RLVR for reasoning tasks.

Abstract

Reinforcement Learning with Verifiable Rewards (RLVR) has propelled Large Language Models in complex reasoning, yet its scalability is often hindered by a training bottleneck where performance plateaus as policy entropy collapses, signaling a loss of exploration. Previous methods typically address this by maintaining high policy entropy, yet the precise mechanisms that govern meaningful exploration have remained underexplored. Our analysis suggests that an unselective focus on entropy risks amplifying irrelevant tokens and destabilizing training. This paper investigates the exploration dynamics within RLVR and identifies a key issue: the gradual elimination of valuable low-probability exploratory tokens, which we term \textbf{\textit{reasoning sparks}}. We find that while abundant in pre-trained models, these sparks are systematically extinguished during RLVR due to over-penalization, leading to a degeneracy in exploration. To address this, we introduce Low-probability Regularization (Lp-Reg). Its core mechanism regularizes the policy towards a heuristic proxy distribution. This proxy is constructed by filtering out presumed noise tokens and re-normalizing the distribution over the remaining candidates. The result is a less-noisy proxy where the probability of \textit{reasoning sparks} is amplified, which then serves as a soft regularization target to shield these valuable tokens from elimination via KL divergence. Experiments show that Lp-Reg enables stable on-policy RL, sustaining continuous scaling across $3,000$ training steps and $81,204$ GPU-hours, where baseline entropy-control methods collapse. This sustained exploration leads to state-of-the-art performance, achieving a $60.17\%$ average accuracy on five math benchmarks, an improvement of $2.66\%$ over prior methods. Code is available at https://github.com/CarlanLark/Lp-Reg.

Low-probability Tokens Sustain Exploration in Reinforcement Learning with Verifiable Reward

TL;DR

This paper identifies entropy collapse in RLVR as a bottleneck that suppresses valuable low-probability tokens, or reasoning sparks, which are essential for sustained exploration. It introduces Low-probability Regularization (Lp-Reg), which uses a proxy distribution crafted by filtering out noise tokens and renormalizing the remainder, and applies a forward KL penalty to preserve these sparks without amplifying irrelevant noise. Empirical results on five math benchmarks across Qwen models show that Lp-Reg enables stable on-policy training for around steps ( GPU-hours) and achieves a state-of-the-art average accuracy of , outperforming prior methods by percentage points. The approach demonstrates that targeting the low-probability tail of next-token distributions, rather than boosting overall entropy, yields robust exploration and practical gains in RLVR for reasoning tasks.

Abstract

Reinforcement Learning with Verifiable Rewards (RLVR) has propelled Large Language Models in complex reasoning, yet its scalability is often hindered by a training bottleneck where performance plateaus as policy entropy collapses, signaling a loss of exploration. Previous methods typically address this by maintaining high policy entropy, yet the precise mechanisms that govern meaningful exploration have remained underexplored. Our analysis suggests that an unselective focus on entropy risks amplifying irrelevant tokens and destabilizing training. This paper investigates the exploration dynamics within RLVR and identifies a key issue: the gradual elimination of valuable low-probability exploratory tokens, which we term \textbf{\textit{reasoning sparks}}. We find that while abundant in pre-trained models, these sparks are systematically extinguished during RLVR due to over-penalization, leading to a degeneracy in exploration. To address this, we introduce Low-probability Regularization (Lp-Reg). Its core mechanism regularizes the policy towards a heuristic proxy distribution. This proxy is constructed by filtering out presumed noise tokens and re-normalizing the distribution over the remaining candidates. The result is a less-noisy proxy where the probability of \textit{reasoning sparks} is amplified, which then serves as a soft regularization target to shield these valuable tokens from elimination via KL divergence. Experiments show that Lp-Reg enables stable on-policy RL, sustaining continuous scaling across training steps and GPU-hours, where baseline entropy-control methods collapse. This sustained exploration leads to state-of-the-art performance, achieving a average accuracy on five math benchmarks, an improvement of over prior methods. Code is available at https://github.com/CarlanLark/Lp-Reg.

Paper Structure

This paper contains 34 sections, 5 equations, 22 figures, 1 table.

Figures (22)

  • Figure 1: Selectively preserving low-probability tokens is key to overcoming performance plateaus in reasoning RL. (a) An illustration of a reasoning spark. (b) Standard GRPO training reaches a performance plateau and collapses, accompanied by decaying entropy. An indiscriminate entropy bonus (GRPO + Entropy Loss) leads to an even faster collapse. (c) We reveal the cause: GRPO systematically suppresses the low-probability sampling of important exploratory tokens (like "wait"), and forces these tokens' sampling distributions to collapse towards high probabilities. Entropy Loss fails to fix this. In contrast, our method, Lp-Reg, successfully preserves a healthy, wide distribution, sustaining exploration. (d) The failure of entropy bonuses is explained by amplifying the low-probability sampling of irrelevant tokens, creating noise, and thereby degrading exploration quality. The aggregated statistics in (c) and (d) demonstrate a systemic effect beyond single-token instances. Detailed plots for individual tokens are available in Appendix \ref{['sec:Details of Sampling Probability Density']}.
  • Figure 2: An example of probability renormalization. $\pi_{\text{proxy}}$ assigns zero probability to tokens with $\pi_{\boldsymbol{\theta}} \leq \tau$ and renormalizes the probability mass to tokens with $\pi_{\boldsymbol{\theta}} > \tau$.
  • Figure 3: Continuous scaling over $3,000$ training steps, totaling $81,204$ GPU-hours, for Lp-Reg (on-policy) on the Qwen2.5-32B-Base model.
  • Figure 4: Training dynamics on the Qwen3-14B-Base model. On-policy training exhibits better training stability and testing performance compared to off-policy training.
  • Figure 5: Training dynamics on the Qwen3-14B-Base model. To best illustrate the performance differences, we compare the top-performing methods. Lp-Reg demonstrates more stable performance throughout training.
  • ...and 17 more figures