The Invisible Leash: Why RLVR May or May Not Escape Its Origin
Fang Wu, Weihao Xuan, Ximing Lu, Mingjie Liu, Yi Dong, Zaid Harchaoui, Yejin Choi
TL;DR
The paper investigates whether Reinforcement Learning with Verifiable Rewards (RLVR) truly expands a base model's reasoning or merely reinforces existing high-reward patterns. It introduces an empirical support framework and a set of metrics to quantify how RLVR changes access to correct solutions across diverse domains and model scales. The findings show RLVR predominantly preserves the base model's solution coverage, with limited expansion and consistent shrinkage, while improving single-shot precision and reducing answer-level diversity. The authors argue that breaking RLVR's invisible leash likely requires explicit exploration or hybrid strategies that seed probability mass into underrepresented solution regions, with practical implications for designing more capable reasoning systems.
Abstract
Recent advances in LLMs highlight RLVR as a promising method for enhancing AI's capabilities, particularly in solving complex logical tasks. However, it remains unclear whether the current practice of RLVR truly expands a model's reasoning boundary or mainly amplifies high-reward outputs that the base model already knows for improved precision. This study presents an empirical investigation that provides fresh insights into the potential limits of the common practice of RLVR. We examine how, under current training conditions, RLVR can operate as a support-constrained optimization mechanism that may restrict the discovery of entirely original solutions, remaining constrained by the base model's initial distribution. We also identify an entropy-reward trade-off: while the current RLVR recipe reliably enhances precision, it may progressively narrow exploration and potentially overlook correct yet underrepresented solutions. Extensive empirical experiments validate that while the current RLVR recipe consistently improves pass@1, the shrinkage of empirical support generally outweighs the expansion of empirical support under larger sampling budgets, failing to recover correct answers that were previously accessible to the base model. Interestingly, we also observe that while RLVR sometimes increases token-level entropy - resulting in greater uncertainty at each generation step - answer-level entropy declines, indicating that these seemingly more uncertain paths ultimately converge onto a smaller set of distinct answers. Taken together, these findings reveal potential limits of the current RLVR recipe in extending reasoning horizons. Breaking this invisible leash may require future algorithmic innovations such as explicit exploration mechanisms or hybrid strategies that seed probability mass into underrepresented solution regions.
