Token Hidden Reward: Steering Exploration-Exploitation in Group Relative Deep Reinforcement Learning
Wenlong Deng, Yi Ren, Yushu Li, Boying Gong, Danica J. Sutherland, Xiaoxiao Li, Christos Thrampoulidis
TL;DR
This work tackles the challenge of steering exploration-exploitation in RL-tuned LLMs by introducing Token Hidden Reward (THR), a token-level metric that measures each token's influence on increasing the likelihood of correct responses within Group Relative Policy Optimization (GRPO). THR reveals that a small subset of tokens disproportionately drives training dynamics, and that the sign of THR can bias learning toward exploitation ($p>0$) or exploration ($p<0$) through a THR-guided reweighting of token advantages. Empirical results on math benchmarks show that THR-based adjustments yield exploitation gains in greedy decoding and exploration gains in Pass@K metrics, with strong generalization to GSPO and model families such as Llama. The work also connects THR to entropy-based exploration, demonstrates cross-token interactions as a key factor, and positions THR as a versatile tool for targeted fine-tuning in reasoning-intensive tasks. Together, these findings offer a fine-grained, dynamical mechanism to control exploration-exploitation in RLVR, enabling more deliberate and effective reasoning improvements in LLMs.
Abstract
Reinforcement learning with verifiable rewards has significantly advanced the reasoning capabilities of large language models, yet how to explicitly steer training toward exploration or exploitation remains an open problem. We introduce Token Hidden Reward (THR), a token-level metric that quantifies each token's influence on the likelihood of correct responses under Group Relative Policy Optimization (GRPO). We find that training dynamics are dominated by a small subset of tokens with high absolute THR values. Most interestingly, tokens with positive THR strengthen confidence in correct outputs, thus favoring exploitation, while tokens with negative THR preserve probability mass for alternative outputs, enabling exploration. This insight suggests a natural intervention: a THR-guided reweighting algorithm that modulates GRPO's learning signals to explicitly bias training toward exploitation or exploration. We validate the efficacy of this algorithm on diverse math reasoning benchmarks. By amplifying tokens with positive THR value and weakening negative ones, our algorithm improves greedy-decoding accuracy, favoring exploitation. The reverse strategy yields consistent gains in Pass@K accuracy, favoring exploration. We further demonstrate that our algorithm integrates seamlessly with other RL objectives such as GSPO and generalizes across architectures including Llama. These findings establish THR as a principled and fine-grained mechanism for dynamically controlling exploration and exploitation in RL-tuned LLMs, providing new tools for targeted fine-tuning in reasoning-intensive applications.
