Table of Contents
Fetching ...

Unlocking Exploration in RLVR: Uncertainty-aware Advantage Shaping for Deeper Reasoning

Can Xie, Ruotong Pan, Xiangyu Wu, Yunfei Zhang, Jiayi Fu, Tingting Gao, Guorui Zhou

TL;DR

This work tackles entropy collapse in Reinforcement Learning with Verifiable Rewards (RLVR) caused by coarse, uniform credit assignment across tokens. It introduces UnCertainty-aware Advantage Shaping (UCAS), a model-free method that leverages internal uncertainty signals at both the trajectory (response) and token levels to reshape the learning signal. UCAS uses a two-stage process: Stage 1 modulates trajectory-level advantage with response-level self-confidence, and Stage 2 applies a token-level certainty penalty derived from raw logits, yielding a final advantage $\,\hat{A}^{\text{UCAS}}_{i,t}$. Across five mathematical reasoning benchmarks and two model scales (1.5B and 7B), UCAS delivers substantial gains, promotes reasoning diversity, and mitigates entropy collapse, demonstrating robust improvements over strong RLVR baselines and enabling deeper, more exploratory reasoning without costly reward models.

Abstract

Reinforcement Learning with Verifiable Rewards (RLVR) has shown significant promise for enhancing the reasoning capabilities of large language models (LLMs). However, prevailing algorithms like GRPO broadcast a uniform advantage signal across all tokens in a sequence. This coarse-grained approach overlooks the pivotal role of uncertain, high-stakes decisions during reasoning, leading to inefficient exploration and the well-documented problem of entropy collapse. To address this, we introduce UnCertainty-aware Advantage Shaping (UCAS), a model-free method that refines credit assignment by leveraging the model's internal uncertainty signals. UCAS operates in two stages: it first modulates the response-level advantage using the model's overall self-confidence, and then applies a token-level penalty based on raw logit certainty. This dual mechanism encourages exploration of high-uncertainty paths that yield correct answers while penalizing overconfident yet erroneous reasoning, effectively balancing the exploration-exploitation trade-off. Extensive experiments on five mathematical reasoning benchmarks show that UCAS significantly outperforms strong RLVR baselines across multiple model scales, including 1.5B and 7B. Our analysis confirms that UCAS not only achieves higher rewards but also promotes greater reasoning diversity and successfully mitigates entropy collapse.

Unlocking Exploration in RLVR: Uncertainty-aware Advantage Shaping for Deeper Reasoning

TL;DR

This work tackles entropy collapse in Reinforcement Learning with Verifiable Rewards (RLVR) caused by coarse, uniform credit assignment across tokens. It introduces UnCertainty-aware Advantage Shaping (UCAS), a model-free method that leverages internal uncertainty signals at both the trajectory (response) and token levels to reshape the learning signal. UCAS uses a two-stage process: Stage 1 modulates trajectory-level advantage with response-level self-confidence, and Stage 2 applies a token-level certainty penalty derived from raw logits, yielding a final advantage . Across five mathematical reasoning benchmarks and two model scales (1.5B and 7B), UCAS delivers substantial gains, promotes reasoning diversity, and mitigates entropy collapse, demonstrating robust improvements over strong RLVR baselines and enabling deeper, more exploratory reasoning without costly reward models.

Abstract

Reinforcement Learning with Verifiable Rewards (RLVR) has shown significant promise for enhancing the reasoning capabilities of large language models (LLMs). However, prevailing algorithms like GRPO broadcast a uniform advantage signal across all tokens in a sequence. This coarse-grained approach overlooks the pivotal role of uncertain, high-stakes decisions during reasoning, leading to inefficient exploration and the well-documented problem of entropy collapse. To address this, we introduce UnCertainty-aware Advantage Shaping (UCAS), a model-free method that refines credit assignment by leveraging the model's internal uncertainty signals. UCAS operates in two stages: it first modulates the response-level advantage using the model's overall self-confidence, and then applies a token-level penalty based on raw logit certainty. This dual mechanism encourages exploration of high-uncertainty paths that yield correct answers while penalizing overconfident yet erroneous reasoning, effectively balancing the exploration-exploitation trade-off. Extensive experiments on five mathematical reasoning benchmarks show that UCAS significantly outperforms strong RLVR baselines across multiple model scales, including 1.5B and 7B. Our analysis confirms that UCAS not only achieves higher rewards but also promotes greater reasoning diversity and successfully mitigates entropy collapse.

Paper Structure

This paper contains 26 sections, 10 equations, 5 figures, 2 tables, 1 algorithm.

Figures (5)

  • Figure 1: Left: Benchmark results across five math reasoning datasets, where our UCAS consistently outperforms RLVR baselines trained on models of the same parameter scale. Right: Training trajectories of UCAS and GRPO on Qwen2.5-Math-7B, showing that UCAS experiences an initial decline but subsequently rises in response length and generation entropy as training progresses. In contrast, GRPO exhibits a continual downward trend in entropy, reflecting the phenomenon of entropy collapse.
  • Figure 2: Overview of the UCAS Advantage Shaping Mechanism. UCAS refines the uniform GRPO advantage through a two-stage process. Stage 1 (Macro-level): It applies Response-Level Modulation using the trajectory's overall self-confidence to determine its strategic value for exploration vs. exploitation. Stage 2 (Micro-level): It introduces a Token-Level Certainty Penalty using raw logits to discourage local overconfidence. The final shaped advantage $\hat{A}^{\text{UCAS}}_{i,t}$ guides a more nuanced policy update.
  • Figure 3: Training dynamics of UCAS compared with GRPO across both 7B and 1.5B models. Left: Reward; Middle: Response Length; Right: Generation Entropy.
  • Figure 4: Comparison of pass@k results on the AIME24 Benchmark.
  • Figure 5: Confidence dynamics before and after UCAS training on the MATH and Olympiad datasets.