Table of Contents
Fetching ...

Exploration v.s. Exploitation: Rethinking RLVR through Clipping, Entropy, and Spurious Reward

Peter Chen, Xiaopeng Li, Ziniu Li, Wotao Yin, Xi Chen, Tianyi Lin

TL;DR

This work probes the paradoxical exploration–exploitation dynamics in reinforcement learning with verifiable rewards (RLVR) for LLM reasoning. It develops a theoretical framework around Group Relative Policy Optimization (GRPO), deriving explicit bounds on clipping bias and introducing a one-step policy-entropy shift that links clipping to entropy changes. The authors show that clipping under random rewards primarily regularizes entropy rather than delivering a learning signal, and that entropy alone does not causally determine performance; they further present a reward-misalignment model to explain when random rewards can improve outcomes, with empirical validation across multiple model families. The findings suggest RLVR benefits depend on model strength and data difficulty, highlighting regime-specific dynamics and offering principled guidance for designing RLVR and entropy-related regularization strategies.

Abstract

This paper examines the exploration-exploitation trade-off in reinforcement learning with verifiable rewards (RLVR), a framework for improving the reasoning of Large Language Models (LLMs). Recent studies suggest that RLVR can elicit strong mathematical reasoning in LLMs through two seemingly paradoxical mechanisms: spurious rewards, which suppress exploitation by rewarding outcomes unrelated to the ground truth, and entropy minimization, which suppresses exploration by pushing the model toward more confident and deterministic outputs, highlighting a puzzling dynamic: both discouraging exploitation and discouraging exploration improve reasoning performance, yet the underlying principles that reconcile these effects remain poorly understood. We focus on two fundamental questions: (i) how policy entropy relates to performance, and (ii) whether spurious rewards yield gains, potentially through the interplay of clipping bias and model contamination. Our results show that clipping bias under spurious rewards reduces policy entropy, leading to more confident and deterministic outputs, while entropy minimization alone is insufficient for improvement. We further propose a reward-misalignment model explaining why spurious rewards can enhance performance beyond contaminated settings. Our findings clarify the mechanisms behind spurious-reward benefits and provide principles for more effective RLVR training.

Exploration v.s. Exploitation: Rethinking RLVR through Clipping, Entropy, and Spurious Reward

TL;DR

This work probes the paradoxical exploration–exploitation dynamics in reinforcement learning with verifiable rewards (RLVR) for LLM reasoning. It develops a theoretical framework around Group Relative Policy Optimization (GRPO), deriving explicit bounds on clipping bias and introducing a one-step policy-entropy shift that links clipping to entropy changes. The authors show that clipping under random rewards primarily regularizes entropy rather than delivering a learning signal, and that entropy alone does not causally determine performance; they further present a reward-misalignment model to explain when random rewards can improve outcomes, with empirical validation across multiple model families. The findings suggest RLVR benefits depend on model strength and data difficulty, highlighting regime-specific dynamics and offering principled guidance for designing RLVR and entropy-related regularization strategies.

Abstract

This paper examines the exploration-exploitation trade-off in reinforcement learning with verifiable rewards (RLVR), a framework for improving the reasoning of Large Language Models (LLMs). Recent studies suggest that RLVR can elicit strong mathematical reasoning in LLMs through two seemingly paradoxical mechanisms: spurious rewards, which suppress exploitation by rewarding outcomes unrelated to the ground truth, and entropy minimization, which suppresses exploration by pushing the model toward more confident and deterministic outputs, highlighting a puzzling dynamic: both discouraging exploitation and discouraging exploration improve reasoning performance, yet the underlying principles that reconcile these effects remain poorly understood. We focus on two fundamental questions: (i) how policy entropy relates to performance, and (ii) whether spurious rewards yield gains, potentially through the interplay of clipping bias and model contamination. Our results show that clipping bias under spurious rewards reduces policy entropy, leading to more confident and deterministic outputs, while entropy minimization alone is insufficient for improvement. We further propose a reward-misalignment model explaining why spurious rewards can enhance performance beyond contaminated settings. Our findings clarify the mechanisms behind spurious-reward benefits and provide principles for more effective RLVR training.

Paper Structure

This paper contains 47 sections, 13 theorems, 177 equations, 9 figures, 1 algorithm.

Key Result

Lemma 2.2

Suppose that $\eta\ge 0$. Then, we have where $\mu(\mathbf h)={\mathbb{E}}_{a \sim \pi_{\textnormal{old}}(\cdot \mid \mathbf h)}[\tilde{A}(\mathbf h,a)]$, $\sigma^2(\mathbf h)=\mathrm{Var}_{a \sim \pi_{\textnormal{old}}(\cdot \mid \mathbf h)}[\tilde{A}(\mathbf h,a)]$ and $C=\frac{1}{36\sqrt{3}(\pi_{\min})^3}$ does not depend on $\eta$. Eq

Figures (9)

  • Figure 1: Independent trials over Qwen2.5-Math-7B on the MATH500 validation set. For performance validation subpanels (Left & Middle), each color represents a different run; the bold line shows the smoothed trajectory, and the faint line of the same color shows the corresponding raw individual run. All later figures follow the same plotting convention. Unclipped training (Left); clipped training (Middle); and clipping activation ratio during training (Right).
  • Figure 2: Policy entropy evolution of Qwen2.5-Math-7B under random-reward training, with results for unclipped training (Left) and clipped training (Middle); Unclipped training with R1-Distill-Llama-8B, an example that leads to the gradient explosion (Right).
  • Figure 3: Results on AIME training set on QwQ-32B (Left), R1-Distill-Llama-8B (Middle-L), Qwen2.5-Math-7B (Middle-R). With one specific example that shows entropy minimization would lead to sub-optimal policy under noisier and more difficult training environment (Right).
  • Figure 4: Results of Qwen2.5-Math-1.5B under clipped training (Left); results of R1-Distill-Llama-8B under clipped training (Middle); percentage improvement (averaged over six independent runs) for different models under the same training and validation setup (Right).
  • Figure 5: All experiments follow the same setup as Figure \ref{['f1']}, varying the threshold $\varepsilon$ with six independent runs for each setting: trials with clipping ratio $\varepsilon = 0.1$ (Left); trials with clipping ratio $\varepsilon = 0.15$ (Middle); and the ratio of clipping activations across $\varepsilon \in \{0.2, 0.15, 0.1\}$ (Right).
  • ...and 4 more figures

Theorems & Definitions (32)

  • Remark 2.1
  • Lemma 2.2
  • Definition 2.3: Random reward
  • Lemma 2.4
  • Remark 2.5: Upper-clipping bias
  • Definition 2.6: Policy entropy
  • Remark 2.7
  • Definition 3.1
  • Theorem 3.2
  • Remark 3.3
  • ...and 22 more