GTPO and GRPO-S: Token and Sequence-Level Reward Shaping with Policy Entropy
Hongze Tan, Jianfei Pan, Jinghao Lin, Tao Chen, Zhihang Zheng, Zhihao Tang, Haihua Yang
TL;DR
Dynamic Entropy Weighting addresses coarse-grained credit assignment in RL-based LLM reasoning. It introduces GTPO for token-level and GRPO-S for sequence-level entropy-weighted rewards, leveraging policy entropy as a cue for cognitive effort. Empirical results on AIME benchmarks show significant improvements over GRPO and DAPO baselines, especially for smaller models, due to better credit assignment and exploration. The work suggests entropy-driven reward shaping as a principled path to more robust long-chain reasoning in LLMs.
Abstract
Reinforcement learning (RL) is a pivotal task for enhancing Large Language Model (LLM) reasoning. Conventional algorithms, however, typically adhere to a coarse-grained credit assignment paradigm, applying a uniform reward to all tokens in a sequence, a critical flaw in long-chain reasoning tasks. In this paper, we address this challenge and propose Dynamic Entropy Weighting, a novel mechanism that facilitates fine-grained rewards through two new algorithms: Group Token Policy Optimization (GTPO), which assigns an entropy-weighted reward to each token, and the analogous algorithm Sequence-Level GRPO (GRPO-S). Our approach is founded on the hypothesis that high policy entropy within a reasoning path is a powerful heuristic for cognitive effort at pivotal junctures, which can be repurposed into a learning signal. By repurposing policy entropy for reward shaping, we achieve true per-token credit assignment. Experimental results across challenging reasoning benchmarks validate the superiority of our approach, showing our methods significantly outperform a strong DAPO baseline and confirming our entropy-weighting mechanism as the key driver of this performance boost.
