Table of Contents
Fetching ...

Beyond the Exploration-Exploitation Trade-off: A Hidden State Approach for LLM Reasoning in RLVR

Fanding Huang, Guanbo Huang, Xiao Fan, Yi He, Xiao Liang, Xiao Chen, Qinting Jiang, Faisal Nadeem Khan, Jingyan Jiang, Zhi Wang

TL;DR

The method, Velocity-Exploiting Rank-Learning (VERL), is the first to operationalize the principle of synergistic exploration-exploitation enhancement by directly shaping the RL advantage function, leveraging the theoretically stable ERA as a predictive meta-controller to create a synergistic, dual-channel incentive structure.

Abstract

A prevailing view in Reinforcement Learning with Verifiable Rewards (RLVR) interprets recent progress through the lens of an exploration-exploitation trade-off, a perspective largely shaped by token-level metrics. We re-examine this perspective, proposing that this perceived trade-off may not be a fundamental constraint but rather an artifact of the measurement level. To investigate this, we shift the analysis to the semantically rich hidden-state space, adopting Effective Rank (ER) to quantify exploration and proposing its novel first- and second-order derivatives, named ER Velocity and ER Acceleration, to capture exploitation dynamics. Our analysis reveals that in the semantic space, exploration and exploitation could be decoupled (Sec.~4). This finding reveals an opportunity to enhance both capacities simultaneously. This insight motivates our method, Velocity-Exploiting Rank-Learning (VERL), the first to operationalize the principle of synergistic exploration-exploitation enhancement by directly shaping the RL advantage function. The key innovation is leveraging the theoretically stable ERA as a predictive meta-controller to create a synergistic, dual-channel incentive structure. Instead of forcing a trade-off, VERL prospectively amplifies rewards for exploration to preempt overconfidence and reinforces exploitative gains to consolidate reasoning. Experiments across diverse LLMs and reasoning benchmarks show consistent gains, including up to 21.4% absolute accuracy improvement on the challenging Gaokao 2024 dataset.

Beyond the Exploration-Exploitation Trade-off: A Hidden State Approach for LLM Reasoning in RLVR

TL;DR

The method, Velocity-Exploiting Rank-Learning (VERL), is the first to operationalize the principle of synergistic exploration-exploitation enhancement by directly shaping the RL advantage function, leveraging the theoretically stable ERA as a predictive meta-controller to create a synergistic, dual-channel incentive structure.

Abstract

A prevailing view in Reinforcement Learning with Verifiable Rewards (RLVR) interprets recent progress through the lens of an exploration-exploitation trade-off, a perspective largely shaped by token-level metrics. We re-examine this perspective, proposing that this perceived trade-off may not be a fundamental constraint but rather an artifact of the measurement level. To investigate this, we shift the analysis to the semantically rich hidden-state space, adopting Effective Rank (ER) to quantify exploration and proposing its novel first- and second-order derivatives, named ER Velocity and ER Acceleration, to capture exploitation dynamics. Our analysis reveals that in the semantic space, exploration and exploitation could be decoupled (Sec.~4). This finding reveals an opportunity to enhance both capacities simultaneously. This insight motivates our method, Velocity-Exploiting Rank-Learning (VERL), the first to operationalize the principle of synergistic exploration-exploitation enhancement by directly shaping the RL advantage function. The key innovation is leveraging the theoretically stable ERA as a predictive meta-controller to create a synergistic, dual-channel incentive structure. Instead of forcing a trade-off, VERL prospectively amplifies rewards for exploration to preempt overconfidence and reinforces exploitative gains to consolidate reasoning. Experiments across diverse LLMs and reasoning benchmarks show consistent gains, including up to 21.4% absolute accuracy improvement on the challenging Gaokao 2024 dataset.

Paper Structure

This paper contains 46 sections, 7 theorems, 68 equations, 23 figures, 7 tables, 1 algorithm.

Key Result

Theorem 3.1

Suppose we have a matrix of embeddings $\mathbf{Z}\in\mathbb{R}^{T \times D}$. Then the ER of $\mathbf{Z}$ is a lower bound of conventional rank of $\mathbf{Z}$:

Figures (23)

  • Figure 1: Comparative analysis with the responses of DeepSeek-R1-Distill-Qwen-7B in simpleRL test dataset zeng2025simplerlzooinvestigatingtamingzero. (a) Traditional metrics for exploitation & exploration constrained by negative coupling, leading to meandering progress for both capabilities. (b) Our metrics are mutually independent. (c) Training regularization with our metrics demonstrates stronger performance in both exploitation (small K) and exploration (large K).
  • Figure 2: Response-level metrics during GRPO post-training, smoothed with a 10-step rolling window. Metrics are shown for the Overall batch, as well as for subsets of Correct and Incorrect samples. The rightmost column displays the average Critic Score (reward) and Response Length per batch.
  • Figure 3: Visualization of dataset-level metrics during GRPO post-training. The figure compares Traditional metrics with our proposed metrics. Also shown are the Validation Score and sample Correctness, both averaged over the validation dataset.
  • Figure 4: Overview of VERL. Exploration is quantified by computing the ER of the rolling-done hidden states via SVD, while exploitation is captured through EMA-smoothed first-order difference (ERV) on per-step rolling hidden state and extended to second-order difference (ERA). Finally, exploration and exploitation are adaptively integrated to derive the auxiliary advantage.
  • Figure 5: Comparison of various hyperparameters with Llama-3.2-3B-Instruct. It shows that the model performs best with a stride of 40 in (a) and with $\kappa = 2$ in (b). We adopt these settings for all subsequent experiments. Moreover, (c) indicates that using only one signal, either exploration or exploitation, leads to suboptimal performance, demonstrating the effectiveness of our method.
  • ...and 18 more figures

Theorems & Definitions (17)

  • Theorem 3.1
  • Remark 3.2
  • Definition 3.3
  • Definition 3.4
  • Proposition 3.5
  • Remark 3.6
  • Proposition F.1: Token-level exploitation and exploration are tightly coupled
  • proof
  • Proposition F.2: Hidden-state metrics are structurally decoupled
  • proof
  • ...and 7 more