Table of Contents
Fetching ...

Beyond Token-Level Policy Gradients for Complex Reasoning with Large Language Models

Mufan Xu, Kehai Chen, Xuefeng Bai, Zhengyu Niu, Muyun Yang, Tiejun Zhao, Min Zhang

TL;DR

Experiments show that MPO outperforms standard token-level policy gradient baselines, and highlight the limitations of token-level policy gradients for complex reasoning, motivating future research to look beyond token-level granularity for reasoning-intensive language tasks.

Abstract

Existing policy-gradient methods for auto-regressive language models typically select subsequent tokens one at a time as actions in the policy. While effective for many generation tasks, such an approach may not fully capture the structure of complex reasoning tasks, where a single semantic decision is often realized across multiple tokens--for example, when defining variables or composing equations. This introduces a potential mismatch between token-level optimization and the inherently block-level nature of reasoning in these settings. To bridge this gap, we propose Multi-token Policy Gradient Optimization (MPO), a framework that treats sequences of K consecutive tokens as unified semantic actions. This block-level perspective enables our method to capture the compositional structure of reasoning trajectories and supports optimization over coherent, higher-level objectives. Experiments on mathematical reasoning and coding benchmarks show that MPO outperforms standard token-level policy gradient baselines, highlight the limitations of token-level policy gradients for complex reasoning, motivating future research to look beyond token-level granularity for reasoning-intensive language tasks.

Beyond Token-Level Policy Gradients for Complex Reasoning with Large Language Models

TL;DR

Experiments show that MPO outperforms standard token-level policy gradient baselines, and highlight the limitations of token-level policy gradients for complex reasoning, motivating future research to look beyond token-level granularity for reasoning-intensive language tasks.

Abstract

Existing policy-gradient methods for auto-regressive language models typically select subsequent tokens one at a time as actions in the policy. While effective for many generation tasks, such an approach may not fully capture the structure of complex reasoning tasks, where a single semantic decision is often realized across multiple tokens--for example, when defining variables or composing equations. This introduces a potential mismatch between token-level optimization and the inherently block-level nature of reasoning in these settings. To bridge this gap, we propose Multi-token Policy Gradient Optimization (MPO), a framework that treats sequences of K consecutive tokens as unified semantic actions. This block-level perspective enables our method to capture the compositional structure of reasoning trajectories and supports optimization over coherent, higher-level objectives. Experiments on mathematical reasoning and coding benchmarks show that MPO outperforms standard token-level policy gradient baselines, highlight the limitations of token-level policy gradients for complex reasoning, motivating future research to look beyond token-level granularity for reasoning-intensive language tasks.
Paper Structure (37 sections, 26 equations, 9 figures, 7 tables)

This paper contains 37 sections, 26 equations, 9 figures, 7 tables.

Figures (9)

  • Figure 1: Left: In reasoning tasks such as mathematical problem-solving or code generation, the model’s decision process often spans across blocks of tokens—such as equations or functions—rather than being determined by each token independently; Right: illustration of token-level (NTP/PPO) vs. block-level (MTP/MPO) optimization. MPO aggregates K tokens as a semantically meaningful block for prediction and optimization, thereby better capturing sequence structure and long-range dependencies.
  • Figure 2: (a) Demonstration of the implementation of MPO warm-up and training process; (b) illustration of the united importance sampling ratio proposed in MPO method; (c) Comparison between single-token and multi-token optimization, multi-token optimization jointly models contiguous tokens as a structural reasoning action.
  • Figure 3: The performance of proposed MPO and baseline methods. MPO outperforms the baselines in most scenarios, demonstrating the effectiveness of aggregating block-wise information.
  • Figure 4: Comparison of the variance of importance sampling ratios and clip fraction during training.
  • Figure 5: Effect of MTP block size $K$ and decay rate $\lambda$ on training stability and performance. Extending the block size to $K=5$ and applying a moderate decay $\lambda=0.8$ produces the most stable and effective result.
  • ...and 4 more figures