Table of Contents
Fetching ...

GTPO and GRPO-S: Token and Sequence-Level Reward Shaping with Policy Entropy

Hongze Tan, Jianfei Pan, Jinghao Lin, Tao Chen, Zhihang Zheng, Zhihao Tang, Haihua Yang

TL;DR

Dynamic Entropy Weighting addresses coarse-grained credit assignment in RL-based LLM reasoning. It introduces GTPO for token-level and GRPO-S for sequence-level entropy-weighted rewards, leveraging policy entropy as a cue for cognitive effort. Empirical results on AIME benchmarks show significant improvements over GRPO and DAPO baselines, especially for smaller models, due to better credit assignment and exploration. The work suggests entropy-driven reward shaping as a principled path to more robust long-chain reasoning in LLMs.

Abstract

Reinforcement learning (RL) is a pivotal task for enhancing Large Language Model (LLM) reasoning. Conventional algorithms, however, typically adhere to a coarse-grained credit assignment paradigm, applying a uniform reward to all tokens in a sequence, a critical flaw in long-chain reasoning tasks. In this paper, we address this challenge and propose Dynamic Entropy Weighting, a novel mechanism that facilitates fine-grained rewards through two new algorithms: Group Token Policy Optimization (GTPO), which assigns an entropy-weighted reward to each token, and the analogous algorithm Sequence-Level GRPO (GRPO-S). Our approach is founded on the hypothesis that high policy entropy within a reasoning path is a powerful heuristic for cognitive effort at pivotal junctures, which can be repurposed into a learning signal. By repurposing policy entropy for reward shaping, we achieve true per-token credit assignment. Experimental results across challenging reasoning benchmarks validate the superiority of our approach, showing our methods significantly outperform a strong DAPO baseline and confirming our entropy-weighting mechanism as the key driver of this performance boost.

GTPO and GRPO-S: Token and Sequence-Level Reward Shaping with Policy Entropy

TL;DR

Dynamic Entropy Weighting addresses coarse-grained credit assignment in RL-based LLM reasoning. It introduces GTPO for token-level and GRPO-S for sequence-level entropy-weighted rewards, leveraging policy entropy as a cue for cognitive effort. Empirical results on AIME benchmarks show significant improvements over GRPO and DAPO baselines, especially for smaller models, due to better credit assignment and exploration. The work suggests entropy-driven reward shaping as a principled path to more robust long-chain reasoning in LLMs.

Abstract

Reinforcement learning (RL) is a pivotal task for enhancing Large Language Model (LLM) reasoning. Conventional algorithms, however, typically adhere to a coarse-grained credit assignment paradigm, applying a uniform reward to all tokens in a sequence, a critical flaw in long-chain reasoning tasks. In this paper, we address this challenge and propose Dynamic Entropy Weighting, a novel mechanism that facilitates fine-grained rewards through two new algorithms: Group Token Policy Optimization (GTPO), which assigns an entropy-weighted reward to each token, and the analogous algorithm Sequence-Level GRPO (GRPO-S). Our approach is founded on the hypothesis that high policy entropy within a reasoning path is a powerful heuristic for cognitive effort at pivotal junctures, which can be repurposed into a learning signal. By repurposing policy entropy for reward shaping, we achieve true per-token credit assignment. Experimental results across challenging reasoning benchmarks validate the superiority of our approach, showing our methods significantly outperform a strong DAPO baseline and confirming our entropy-weighting mechanism as the key driver of this performance boost.

Paper Structure

This paper contains 38 sections, 38 equations, 9 figures, 1 table, 1 algorithm.

Figures (9)

  • Figure 1: Conceptual illustration of reward assignment. (a) Traditional methods assign a uniform reward based on the final outcome. In contrast, our methods use Dynamic Entropy Weighting to refine credit assignment: (c) GTPO rewards high-entropy tokens in correct sequences while suppressing them in incorrect ones, and (d) GRPO-S rewards correct sequences with higher average entropy while penalizing incorrect paths. Yielding superior performance (b).
  • Figure 2: A high-level comparison of the reward signaling process. Conventional methods like GRPO/DAPO use a static reward model to assign a uniform reward to an entire sequence. Our framework, encompassing GTPO and GRPO-S, introduces a Dynamic Entropy Weighting module that reshapes this signal into fine-grained rewards at either the token or sequence level before it is used by the policy model.
  • Figure 3: Hyperparameter comparison.
  • Figure 4: Mean reward trajectories on the test sets. All curves are smoothed for visual clarity. Each row corresponds to a different experimental setting: (Top) AIME 2024 with Qwen2.5-32B, (Middle) AIME 2025 with Qwen2.5-32B, and (Bottom) AIME 2025 with Qwen2.5-7B. Columns show different metrics from left to right: Mean Reward, Mean Reward of Pass@2, Pass@8, and Pass@32. For brevity and to maintain visual clarity, the corresponding results for Pass@4 and Pass@16, which exhibit similar trends, are deferred to Appendix \ref{['sec: complete pass']}, see Fig. \ref{['fig:pass4_curves']} and Fig. \ref{['fig:pass16_curves']}.
  • Figure 5: The Entropy Rebound Phenomenon and its Effect on Response Length. Top Row: The policy entropy trajectories for experiments on (left to right) AIME 2024 with Qwen2.5-32B, AIME 2025 with Qwen2.5-32B, and AIME 2025 with Qwen2.5-7B. Our methods (GTPO, GRPO-S) exhibit a distinct entropy rebound after an initial dip, successfully counteracting the policy collapse observed in the DAPO baseline. Bottom Row: The corresponding average response length trajectories. The sustained exploration enabled by the entropy rebound directly manifests as an increase in the average response length, indicating more thorough and diverse reasoning.
  • ...and 4 more figures