Table of Contents
Fetching ...

Quantile Advantage Estimation: Stabilizing RLVR for LLM Reasoning

Junkang Wu, Kexin Huang, Jiancan Wu, An Zhang, Xiang Wang, Xiangnan He

TL;DR

baseline design -- rather than token-level heuristics -- is identified as the primary mechanism for scaling RLVR, giving lower and upper bounds on one-step entropy change that curb explosion and prevent collapse.

Abstract

Reinforcement Learning with Verifiable Rewards (RLVR) strengthens LLM reasoning, but training often oscillates between {entropy collapse} and {entropy explosion}. We trace both hazards to the mean baseline used in value-free RL (e.g., GRPO and DAPO), which improperly penalizes negative-advantage samples under reward outliers. We propose {Quantile Advantage Estimation} (QAE), replacing the mean with a group-wise K-quantile baseline. QAE induces a response-level, two-regime gate: on hard queries (p <= 1 - K) it reinforces rare successes, while on easy queries (p > 1 - K) it targets remaining failures. Under first-order softmax updates, we prove {two-sided entropy safety}, giving lower and upper bounds on one-step entropy change that curb explosion and prevent collapse. Empirically, this minimal modification stabilizes entropy, sparsifies credit assignment (with tuned K, roughly 80% of responses receive zero advantage), and yields sustained pass@1 gains on Qwen3-8B/14B-Base across AIME 2024/2025 and AMC 2023. These results identify {baseline design} -- rather than token-level heuristics -- as the primary mechanism for scaling RLVR.

Quantile Advantage Estimation: Stabilizing RLVR for LLM Reasoning

TL;DR

baseline design -- rather than token-level heuristics -- is identified as the primary mechanism for scaling RLVR, giving lower and upper bounds on one-step entropy change that curb explosion and prevent collapse.

Abstract

Reinforcement Learning with Verifiable Rewards (RLVR) strengthens LLM reasoning, but training often oscillates between {entropy collapse} and {entropy explosion}. We trace both hazards to the mean baseline used in value-free RL (e.g., GRPO and DAPO), which improperly penalizes negative-advantage samples under reward outliers. We propose {Quantile Advantage Estimation} (QAE), replacing the mean with a group-wise K-quantile baseline. QAE induces a response-level, two-regime gate: on hard queries (p <= 1 - K) it reinforces rare successes, while on easy queries (p > 1 - K) it targets remaining failures. Under first-order softmax updates, we prove {two-sided entropy safety}, giving lower and upper bounds on one-step entropy change that curb explosion and prevent collapse. Empirically, this minimal modification stabilizes entropy, sparsifies credit assignment (with tuned K, roughly 80% of responses receive zero advantage), and yields sustained pass@1 gains on Qwen3-8B/14B-Base across AIME 2024/2025 and AMC 2023. These results identify {baseline design} -- rather than token-level heuristics -- as the primary mechanism for scaling RLVR.

Paper Structure

This paper contains 56 sections, 4 theorems, 29 equations, 11 figures, 2 tables.

Key Result

Proposition 4.0

Assume binary rewards, group size $G\!\ge\!2$, and the right-continuous empirical quantile. Using the standardized advantage in Eqs. eq:quantile-adv–eq:bern-quantile, the learning objective is (up to a constant factor depending on $\varepsilon$) equivalent to

Figures (11)

  • Figure 1: Entropy–performance dynamics on Qwen3-8B-Base. Left: DAPO with Clip-Higher prevents early collapse but triggers an early entropy spike (steps 10–80) and a later performance plateau. Right: our quantile baseline (QAE) stabilizes policy entropy and sustains pass@1 gains by steering training into a balanced exploration regime.
  • Figure 2: DAPO training dynamics on Qwen3--8B.Left: without Clip-Higher; Right: with Clip-Higher. In both settings we observe two phases---an early correlated growth between anthropomorphic token frequency and pass@1, followed by a decoupling then plateau. While Clip-Higher averts collapse, it does not prevent the later performance stall.
  • Figure 3: Evolution of high-entropy token usage under DAPO (steps 20/80/200). Early training exhibits diverse anthropomorphic tokens (e.g., wait, perhaps); by steps 80--200 the distribution homogenizes around rigid reasoning templates (e.g., so, let), indicating reduced exploratory diversity consistent with entropy explosion.
  • Figure 4: Quantile baseline reshapes weighting and entropy dynamics.Left: policy entropy over training split by advantage sign—negative-advantage samples drive the surge. Middle/Right: query-level weights vs. success rate $p$; GRPO & DAPO use symmetric $\sqrt{p(1-p)}$ weighting, whereas our method applies a thresholded scheme ($K\!=\!0.4$).
  • Figure 5: Training dynamics and sparsity.(a) AIME'24 (Qwen3--8B): QAE boosts pass@1 while keeping pass@16 comparable—showing higher sample efficiency. (b) Entropy by sign: DAPO’s explosion stems from negative-advantage samples; QAE suppresses it. (c) Response sparsity: 80% responses have zero advantage, focusing updates on informative subsets.
  • ...and 6 more figures

Theorems & Definitions (6)

  • Proposition 4.0: Quantile-regulated objective
  • Proposition 4.0: Two-regime entropy safety of $K$-quantile
  • Proposition A.0: Quantile-regulated objective
  • proof
  • Proposition A.0: Two-regime entropy safety of $K$-quantile
  • proof