Exploration v.s. Exploitation: Rethinking RLVR through Clipping, Entropy, and Spurious Reward
Peter Chen, Xiaopeng Li, Ziniu Li, Wotao Yin, Xi Chen, Tianyi Lin
TL;DR
This work probes the paradoxical exploration–exploitation dynamics in reinforcement learning with verifiable rewards (RLVR) for LLM reasoning. It develops a theoretical framework around Group Relative Policy Optimization (GRPO), deriving explicit bounds on clipping bias and introducing a one-step policy-entropy shift that links clipping to entropy changes. The authors show that clipping under random rewards primarily regularizes entropy rather than delivering a learning signal, and that entropy alone does not causally determine performance; they further present a reward-misalignment model to explain when random rewards can improve outcomes, with empirical validation across multiple model families. The findings suggest RLVR benefits depend on model strength and data difficulty, highlighting regime-specific dynamics and offering principled guidance for designing RLVR and entropy-related regularization strategies.
Abstract
This paper examines the exploration-exploitation trade-off in reinforcement learning with verifiable rewards (RLVR), a framework for improving the reasoning of Large Language Models (LLMs). Recent studies suggest that RLVR can elicit strong mathematical reasoning in LLMs through two seemingly paradoxical mechanisms: spurious rewards, which suppress exploitation by rewarding outcomes unrelated to the ground truth, and entropy minimization, which suppresses exploration by pushing the model toward more confident and deterministic outputs, highlighting a puzzling dynamic: both discouraging exploitation and discouraging exploration improve reasoning performance, yet the underlying principles that reconcile these effects remain poorly understood. We focus on two fundamental questions: (i) how policy entropy relates to performance, and (ii) whether spurious rewards yield gains, potentially through the interplay of clipping bias and model contamination. Our results show that clipping bias under spurious rewards reduces policy entropy, leading to more confident and deterministic outputs, while entropy minimization alone is insufficient for improvement. We further propose a reward-misalignment model explaining why spurious rewards can enhance performance beyond contaminated settings. Our findings clarify the mechanisms behind spurious-reward benefits and provide principles for more effective RLVR training.
