Token-Level Policy Optimization: Linking Group-Level Rewards to Token-Level Aggregation via Markov Likelihood
Xingyu Lin, Yilin Wen, En Wang, Du Su, Wenbin Liu, Chenfu Bao, Zhonghou Lv
TL;DR
This paper addresses the instability and sparsity challenges in CRPO-based LLM fine-tuning for mathematical reasoning by introducing TEPO, which leverages Markov Likelihood to connect group-level rewards to token-level updates. By implementing a token-level optimization with a sequence-level to token-level credit assignment via a geometric-mean importance ratio, TEPO reduces gradient variance and avoids entropy-induced collapse common in critic-free GRPO settings. Theoretical analysis and extensive experiments on seven math benchmarks show TEPO achieving state-of-the-art performance (notably on MATH-500) and improved training stability, with ablations confirming the superiority of token-level Markov-Likelihood guidance over entropy-based regularization. The approach offers a practical path to more reliable, scalable reasoning in LLMs and broad implications for policy optimization under sparse, long-horizon rewards.
Abstract
Group Relative Policy Optimization (GRPO) has significantly advanced the reasoning ability of large language models (LLMs), particularly by boosting their mathematical performance. However, GRPO and related entropy-regularization methods still face challenges rooted in the sparse token rewards inherent to chain-of-thought (CoT). Current approaches often rely on undifferentiated token-level entropy adjustments, which frequently lead to entropy collapse or model collapse. In this work, we propose TEPO, a novel token-level framework that incorporates Markov Likelihood (sequence likelihood) links group-level rewards with tokens via token-level aggregation. Experiments show that TEPO consistently outperforms existing baselines across key metrics (including @k and accuracy). It not only sets a new state of the art on mathematical reasoning tasks but also significantly enhances training stability.
