Table of Contents
Fetching ...

No Prompt Left Behind: Exploiting Zero-Variance Prompts in LLM Reinforcement Learning via Entropy-Guided Advantage Shaping

Thanh-Long V. Le, Myeongho Jeon, Kim Vu, Viet Lai, Eunho Yang

TL;DR

The paper addresses the problem that zero-variance prompts yield vanishing advantages in reinforcement learning for LLM reasoning. It introduces RL-ZVP, an entropy-guided advantage shaping method that preserves and scales learning signals from prompts where all responses share the same reward, while keeping GRPO behavior on non-zero-variance prompts. By assigning direction based on correctness and magnitude via token entropy, RL-ZVP achieves substantial gains across six math benchmarks on two model scales, surpassing GRPO and prompt-filtering baselines even with fewer rollouts. This work demonstrates that zero-variance prompts can be a valuable source of learning signals for RLVR, offering improved data efficiency and more robust, long-form reasoning in LLMs.

Abstract

Reinforcement Learning with Verifiable Rewards (RLVR) is a powerful framework for improving the reasoning abilities of Large Language Models (LLMs). However, current methods such as GRPO rely only on problems where the model responses to the same input differ in correctness, while ignoring those where all responses receive the same reward -- so-called zero-variance prompts. In this work, we argue that such prompts are not useless but can, in fact, provide meaningful feedback for policy optimization. To this end, we introduce RL with Zero-Variance Prompts (RL-ZVP), a novel algorithm that extract learning signals from zero-variance prompts. RL-ZVP directly rewards correctness and penalizes errors even without contrasting responses, modulating feedback with token-level characteristics to preserve informative, nuanced signals. Across six math reasoning benchmarks, RL-ZVP achieves significant improvements of up to 8.61 points in accuracy and 7.77 points in pass rate over GRPO, while consistently outperforming other baselines that filter out zero-variance prompts. These results highlight the untapped potential of learning from zero-variance prompts in RLVR.

No Prompt Left Behind: Exploiting Zero-Variance Prompts in LLM Reinforcement Learning via Entropy-Guided Advantage Shaping

TL;DR

The paper addresses the problem that zero-variance prompts yield vanishing advantages in reinforcement learning for LLM reasoning. It introduces RL-ZVP, an entropy-guided advantage shaping method that preserves and scales learning signals from prompts where all responses share the same reward, while keeping GRPO behavior on non-zero-variance prompts. By assigning direction based on correctness and magnitude via token entropy, RL-ZVP achieves substantial gains across six math benchmarks on two model scales, surpassing GRPO and prompt-filtering baselines even with fewer rollouts. This work demonstrates that zero-variance prompts can be a valuable source of learning signals for RLVR, offering improved data efficiency and more robust, long-form reasoning in LLMs.

Abstract

Reinforcement Learning with Verifiable Rewards (RLVR) is a powerful framework for improving the reasoning abilities of Large Language Models (LLMs). However, current methods such as GRPO rely only on problems where the model responses to the same input differ in correctness, while ignoring those where all responses receive the same reward -- so-called zero-variance prompts. In this work, we argue that such prompts are not useless but can, in fact, provide meaningful feedback for policy optimization. To this end, we introduce RL with Zero-Variance Prompts (RL-ZVP), a novel algorithm that extract learning signals from zero-variance prompts. RL-ZVP directly rewards correctness and penalizes errors even without contrasting responses, modulating feedback with token-level characteristics to preserve informative, nuanced signals. Across six math reasoning benchmarks, RL-ZVP achieves significant improvements of up to 8.61 points in accuracy and 7.77 points in pass rate over GRPO, while consistently outperforming other baselines that filter out zero-variance prompts. These results highlight the untapped potential of learning from zero-variance prompts in RLVR.

Paper Structure

This paper contains 23 sections, 55 equations, 8 figures, 9 tables, 1 algorithm.

Figures (8)

  • Figure 1: Left: RL-ZVP employs an entropy-guided advantage shaping formula to extract learning signals from zero-variance prompts. For non-zero-variance prompts, it reverts to the standard GRPO formulation. Right: RL-ZVP demonstrates significantly higher average accuracy than GRPO on both Qwen3-1.7B-Base and Qwen3-8B-Base across six math reasoning benchmarks.
  • Figure 2: Rollout generation overhead as a percentage of total time of the training step.
  • Figure 3: The ratio of zero-variance prompts across two experimental scales.
  • Figure 4: Average accuracy (a) and pass rate (b) on six math reasoning benchmarks. RL-ZVP consistently delivers the strongest performance among all baselines.
  • Figure 5: Validation accuracy and training dynamics at different experiment scales. Each row shows Acc@8, entropy, and response length during training for Qwen3-1.7B-Base (top) and Qwen3-8B-Base (bottom). RL-ZVP exhibits more consistent and stable trends than GRPO.
  • ...and 3 more figures

Theorems & Definitions (3)

  • Remark 1: Advantage Vanishing
  • Remark 2: The Role of Advantage
  • Remark 3: Relationship to GRPO