Table of Contents
Fetching ...

Token-Level Policy Optimization: Linking Group-Level Rewards to Token-Level Aggregation via Markov Likelihood

Xingyu Lin, Yilin Wen, En Wang, Du Su, Wenbin Liu, Chenfu Bao, Zhonghou Lv

TL;DR

This paper addresses the instability and sparsity challenges in CRPO-based LLM fine-tuning for mathematical reasoning by introducing TEPO, which leverages Markov Likelihood to connect group-level rewards to token-level updates. By implementing a token-level optimization with a sequence-level to token-level credit assignment via a geometric-mean importance ratio, TEPO reduces gradient variance and avoids entropy-induced collapse common in critic-free GRPO settings. Theoretical analysis and extensive experiments on seven math benchmarks show TEPO achieving state-of-the-art performance (notably on MATH-500) and improved training stability, with ablations confirming the superiority of token-level Markov-Likelihood guidance over entropy-based regularization. The approach offers a practical path to more reliable, scalable reasoning in LLMs and broad implications for policy optimization under sparse, long-horizon rewards.

Abstract

Group Relative Policy Optimization (GRPO) has significantly advanced the reasoning ability of large language models (LLMs), particularly by boosting their mathematical performance. However, GRPO and related entropy-regularization methods still face challenges rooted in the sparse token rewards inherent to chain-of-thought (CoT). Current approaches often rely on undifferentiated token-level entropy adjustments, which frequently lead to entropy collapse or model collapse. In this work, we propose TEPO, a novel token-level framework that incorporates Markov Likelihood (sequence likelihood) links group-level rewards with tokens via token-level aggregation. Experiments show that TEPO consistently outperforms existing baselines across key metrics (including @k and accuracy). It not only sets a new state of the art on mathematical reasoning tasks but also significantly enhances training stability.

Token-Level Policy Optimization: Linking Group-Level Rewards to Token-Level Aggregation via Markov Likelihood

TL;DR

This paper addresses the instability and sparsity challenges in CRPO-based LLM fine-tuning for mathematical reasoning by introducing TEPO, which leverages Markov Likelihood to connect group-level rewards to token-level updates. By implementing a token-level optimization with a sequence-level to token-level credit assignment via a geometric-mean importance ratio, TEPO reduces gradient variance and avoids entropy-induced collapse common in critic-free GRPO settings. Theoretical analysis and extensive experiments on seven math benchmarks show TEPO achieving state-of-the-art performance (notably on MATH-500) and improved training stability, with ablations confirming the superiority of token-level Markov-Likelihood guidance over entropy-based regularization. The approach offers a practical path to more reliable, scalable reasoning in LLMs and broad implications for policy optimization under sparse, long-horizon rewards.

Abstract

Group Relative Policy Optimization (GRPO) has significantly advanced the reasoning ability of large language models (LLMs), particularly by boosting their mathematical performance. However, GRPO and related entropy-regularization methods still face challenges rooted in the sparse token rewards inherent to chain-of-thought (CoT). Current approaches often rely on undifferentiated token-level entropy adjustments, which frequently lead to entropy collapse or model collapse. In this work, we propose TEPO, a novel token-level framework that incorporates Markov Likelihood (sequence likelihood) links group-level rewards with tokens via token-level aggregation. Experiments show that TEPO consistently outperforms existing baselines across key metrics (including @k and accuracy). It not only sets a new state of the art on mathematical reasoning tasks but also significantly enhances training stability.

Paper Structure

This paper contains 35 sections, 2 theorems, 30 equations, 5 figures, 2 tables, 1 algorithm.

Key Result

Lemma 3.1

For a softmax policy $\pi_\theta(a \mid s) \propto \exp(\phi_\theta(s,a))$:

Figures (5)

  • Figure 1: Performance on math reasoning benchmarks. Our method, shown as a dashed red line, stands out and remains best or near-best on most tasks, especially on MATH-500, and it also achieves a strong overall performance on average.
  • Figure 2: Clip Ratio over Steps
  • Figure 3: Comparative analysis of TEPO versus DAPO/GRPO w. Clip-Higher training dynamics. The left panel shows reward progression, while the right panel displays gradient norm throughout training.
  • Figure 4: Training dynamics with different entropy regularization strategies. The complete TEPO demonstrates superior stability and performance compared to variants with maximum entropy or KL-divergence regularization.
  • Figure 5: Comparison of importance sampling strategies in sparse reward environments. Our approach demonstrates more effective utilization of sparse learning signals compared to REINFORCE and prefix-based importance sampling.

Theorems & Definitions (4)

  • Lemma 3.1
  • proof
  • Lemma 3.2
  • proof