Table of Contents
Fetching ...

Token Hidden Reward: Steering Exploration-Exploitation in Group Relative Deep Reinforcement Learning

Wenlong Deng, Yi Ren, Yushu Li, Boying Gong, Danica J. Sutherland, Xiaoxiao Li, Christos Thrampoulidis

TL;DR

This work tackles the challenge of steering exploration-exploitation in RL-tuned LLMs by introducing Token Hidden Reward (THR), a token-level metric that measures each token's influence on increasing the likelihood of correct responses within Group Relative Policy Optimization (GRPO). THR reveals that a small subset of tokens disproportionately drives training dynamics, and that the sign of THR can bias learning toward exploitation ($p>0$) or exploration ($p<0$) through a THR-guided reweighting of token advantages. Empirical results on math benchmarks show that THR-based adjustments yield exploitation gains in greedy decoding and exploration gains in Pass@K metrics, with strong generalization to GSPO and model families such as Llama. The work also connects THR to entropy-based exploration, demonstrates cross-token interactions as a key factor, and positions THR as a versatile tool for targeted fine-tuning in reasoning-intensive tasks. Together, these findings offer a fine-grained, dynamical mechanism to control exploration-exploitation in RLVR, enabling more deliberate and effective reasoning improvements in LLMs.

Abstract

Reinforcement learning with verifiable rewards has significantly advanced the reasoning capabilities of large language models, yet how to explicitly steer training toward exploration or exploitation remains an open problem. We introduce Token Hidden Reward (THR), a token-level metric that quantifies each token's influence on the likelihood of correct responses under Group Relative Policy Optimization (GRPO). We find that training dynamics are dominated by a small subset of tokens with high absolute THR values. Most interestingly, tokens with positive THR strengthen confidence in correct outputs, thus favoring exploitation, while tokens with negative THR preserve probability mass for alternative outputs, enabling exploration. This insight suggests a natural intervention: a THR-guided reweighting algorithm that modulates GRPO's learning signals to explicitly bias training toward exploitation or exploration. We validate the efficacy of this algorithm on diverse math reasoning benchmarks. By amplifying tokens with positive THR value and weakening negative ones, our algorithm improves greedy-decoding accuracy, favoring exploitation. The reverse strategy yields consistent gains in Pass@K accuracy, favoring exploration. We further demonstrate that our algorithm integrates seamlessly with other RL objectives such as GSPO and generalizes across architectures including Llama. These findings establish THR as a principled and fine-grained mechanism for dynamically controlling exploration and exploitation in RL-tuned LLMs, providing new tools for targeted fine-tuning in reasoning-intensive applications.

Token Hidden Reward: Steering Exploration-Exploitation in Group Relative Deep Reinforcement Learning

TL;DR

This work tackles the challenge of steering exploration-exploitation in RL-tuned LLMs by introducing Token Hidden Reward (THR), a token-level metric that measures each token's influence on increasing the likelihood of correct responses within Group Relative Policy Optimization (GRPO). THR reveals that a small subset of tokens disproportionately drives training dynamics, and that the sign of THR can bias learning toward exploitation () or exploration () through a THR-guided reweighting of token advantages. Empirical results on math benchmarks show that THR-based adjustments yield exploitation gains in greedy decoding and exploration gains in Pass@K metrics, with strong generalization to GSPO and model families such as Llama. The work also connects THR to entropy-based exploration, demonstrates cross-token interactions as a key factor, and positions THR as a versatile tool for targeted fine-tuning in reasoning-intensive tasks. Together, these findings offer a fine-grained, dynamical mechanism to control exploration-exploitation in RLVR, enabling more deliberate and effective reasoning improvements in LLMs.

Abstract

Reinforcement learning with verifiable rewards has significantly advanced the reasoning capabilities of large language models, yet how to explicitly steer training toward exploration or exploitation remains an open problem. We introduce Token Hidden Reward (THR), a token-level metric that quantifies each token's influence on the likelihood of correct responses under Group Relative Policy Optimization (GRPO). We find that training dynamics are dominated by a small subset of tokens with high absolute THR values. Most interestingly, tokens with positive THR strengthen confidence in correct outputs, thus favoring exploitation, while tokens with negative THR preserve probability mass for alternative outputs, enabling exploration. This insight suggests a natural intervention: a THR-guided reweighting algorithm that modulates GRPO's learning signals to explicitly bias training toward exploitation or exploration. We validate the efficacy of this algorithm on diverse math reasoning benchmarks. By amplifying tokens with positive THR value and weakening negative ones, our algorithm improves greedy-decoding accuracy, favoring exploitation. The reverse strategy yields consistent gains in Pass@K accuracy, favoring exploration. We further demonstrate that our algorithm integrates seamlessly with other RL objectives such as GSPO and generalizes across architectures including Llama. These findings establish THR as a principled and fine-grained mechanism for dynamically controlling exploration and exploitation in RL-tuned LLMs, providing new tools for targeted fine-tuning in reasoning-intensive applications.

Paper Structure

This paper contains 26 sections, 2 theorems, 25 equations, 10 figures, 10 tables.

Key Result

Theorem 3.1

For any question $\bm{x}$, at any time $t\geq0$ of training, and any correct response $\bm{y}_i^+, i\in[N^+]$ , in addition to its dependence on the token unembeddings, the likelihood change $\frac{d}{dt} \ln \pi_{\theta(t)} (\bm{y}_i^+ | \bm{x})$ decreases as the following quantity increases: Here, the weights $\alpha^\pm_{k,k'}$ quantify the similarity of token-level prediction errors across re

Figures (10)

  • Figure 1: Our THR algorithm identifies high-influence tokens and reweights their learning signals based on sign: when $p > 0$, positive THR tokens are amplified (exploitation); when $p < 0$, negative THR tokens are amplified (exploration). The figure demonstrates control of exploration-exploitation trade-off.
  • Figure 2: Density of THR scores for Qwen2.5-Math-1.5B. For both correct responses (a) and incorrect responses (b), we observe that only a small subset of tokens exhibits significantly high THR values. Notably, both types of responses contain tokens with both positive and negative THR scores.
  • Figure 3: Overlap between high THR and high entropy tokens. For each sample, we quantify the overlap between tokens with high THR and high entropy, and plot the resulting density. The distribution shows a pronounced peak near 90%, highlighting a strong token-level association between these two metrics.
  • Figure 4: Mean of AIME 2024, AIME 2025, and AMC23 datasets' Pass@K performance of THR on GSPO using Qwen2.5-Math-1.5B across different K.
  • Figure 5: Mean of AIME 2024, AIME 2025, and AMC23 datasets' Pass@K performance of different methods on Llama3.2-3B-Instruct across different K.
  • ...and 5 more figures

Theorems & Definitions (3)

  • Theorem 3.1
  • Definition 4.1
  • Corollary 4.2