Table of Contents
Fetching ...

Segment Policy Optimization: Effective Segment-Level Credit Assignment in RL for Large Language Models

Yiran Guo, Lijie Xu, Jie Liu, Dan Ye, Shuang Qiu

TL;DR

SPO introduces segment-level credit assignment to bridge token-level and trajectory-level RL for LLMs, eliminating reliance on unstable critics by estimating segment advantages $A_k^{\mathrm{seg}}$ with Monte Carlo. The framework deploys three components—flexible segment partition, MC-based segment advantage estimation, and policy optimization with segment advantages (including a probability-mask variant)—and provides two instantiations: SPO-chain for short CoT and SPO-tree for long CoT. Empirical results on GSM8K and MATH500 show substantial accuracy gains over PPO, GRPO, and VinePPO, with improved sample efficiency and reduced computation in long-horizon reasoning. The approach broadens RLHF applicability to longer contexts and more diverse reasoning tasks, offering a practical, critic-free avenue for effective segment-level credit assignment in LLMs.

Abstract

Enhancing the reasoning capabilities of large language models effectively using reinforcement learning (RL) remains a crucial challenge. Existing approaches primarily adopt two contrasting advantage estimation granularities: token-level methods (e.g., PPO) aim to provide fine-grained advantage signals but suffer from inaccurate estimation due to difficulties in training an accurate critic model. On the other extreme, trajectory-level methods (e.g., GRPO) solely rely on a coarse-grained advantage signal from the final reward, leading to imprecise credit assignment. To address these limitations, we propose Segment Policy Optimization (SPO), a novel RL framework that leverages segment-level advantage estimation at an intermediate granularity, achieving a better balance by offering more precise credit assignment than trajectory-level methods and requiring fewer estimation points than token-level methods, enabling accurate advantage estimation based on Monte Carlo (MC) without a critic model. SPO features three components with novel strategies: (1) flexible segment partition; (2) accurate segment advantage estimation; and (3) policy optimization using segment advantages, including a novel probability-mask strategy. We further instantiate SPO for two specific scenarios: (1) SPO-chain for short chain-of-thought (CoT), featuring novel cutpoint-based partition and chain-based advantage estimation, achieving $6$-$12$ percentage point improvements in accuracy over PPO and GRPO on GSM8K. (2) SPO-tree for long CoT, featuring novel tree-based advantage estimation, which significantly reduces the cost of MC estimation, achieving $7$-$11$ percentage point improvements over GRPO on MATH500 under 2K and 4K context evaluation. We make our code publicly available at https://github.com/AIFrameResearch/SPO.

Segment Policy Optimization: Effective Segment-Level Credit Assignment in RL for Large Language Models

TL;DR

SPO introduces segment-level credit assignment to bridge token-level and trajectory-level RL for LLMs, eliminating reliance on unstable critics by estimating segment advantages with Monte Carlo. The framework deploys three components—flexible segment partition, MC-based segment advantage estimation, and policy optimization with segment advantages (including a probability-mask variant)—and provides two instantiations: SPO-chain for short CoT and SPO-tree for long CoT. Empirical results on GSM8K and MATH500 show substantial accuracy gains over PPO, GRPO, and VinePPO, with improved sample efficiency and reduced computation in long-horizon reasoning. The approach broadens RLHF applicability to longer contexts and more diverse reasoning tasks, offering a practical, critic-free avenue for effective segment-level credit assignment in LLMs.

Abstract

Enhancing the reasoning capabilities of large language models effectively using reinforcement learning (RL) remains a crucial challenge. Existing approaches primarily adopt two contrasting advantage estimation granularities: token-level methods (e.g., PPO) aim to provide fine-grained advantage signals but suffer from inaccurate estimation due to difficulties in training an accurate critic model. On the other extreme, trajectory-level methods (e.g., GRPO) solely rely on a coarse-grained advantage signal from the final reward, leading to imprecise credit assignment. To address these limitations, we propose Segment Policy Optimization (SPO), a novel RL framework that leverages segment-level advantage estimation at an intermediate granularity, achieving a better balance by offering more precise credit assignment than trajectory-level methods and requiring fewer estimation points than token-level methods, enabling accurate advantage estimation based on Monte Carlo (MC) without a critic model. SPO features three components with novel strategies: (1) flexible segment partition; (2) accurate segment advantage estimation; and (3) policy optimization using segment advantages, including a novel probability-mask strategy. We further instantiate SPO for two specific scenarios: (1) SPO-chain for short chain-of-thought (CoT), featuring novel cutpoint-based partition and chain-based advantage estimation, achieving - percentage point improvements in accuracy over PPO and GRPO on GSM8K. (2) SPO-tree for long CoT, featuring novel tree-based advantage estimation, which significantly reduces the cost of MC estimation, achieving - percentage point improvements over GRPO on MATH500 under 2K and 4K context evaluation. We make our code publicly available at https://github.com/AIFrameResearch/SPO.

Paper Structure

This paper contains 19 sections, 36 equations, 12 figures, 1 table, 1 algorithm.

Figures (12)

  • Figure 1: Overview of SPO framework. Our framework consists of three components: segment partition, segment advantage estimation, and policy optimization, each of which can be implemented in different ways. This figure illustrates the cutpoint-based partition strategy used in SPO-chain, where partitioning occurs after a predetermined number of cutpoints. It also illustrates our probability-mask policy optimization method, which assigns the corresponding segment advantages specifically to the cutpoints instead of all tokens within a segment.
  • Figure 2: (a) Chain-based advantage estimation method. For each segment, we independently sample $N$ trajectories to estimate its value $V$. The advantage for segment $k$ is estimated as $\hat{V}(s_{t_{k+1}})-\hat{V}(s_{t_k})$. (b) Tree-based advantage estimation method. Trajectories are organized in a tree structure, where nodes sharing the same parent form a group with identical prompts and token counts (except for leaf nodes, whose token lengths may vary). This hierarchical organization facilitates the calculation of advantages within each group.
  • Figure 3: (a) Test accuracy comparison of different methods on GSM8K. Baseline results are from kazemnejad2024vineppounlockingrlpotential. (b) Episode generation time comparison between SPO-chain (int5) and VinePPO during training. (c) Validation accuracy of SPO-chain (int5) and GRPO during training.
  • Figure 4: (a) Variations of segment partition granularity (different cutpoint intervals). (b) Variations of segment partition strategies. (c) Ablation on probability-mask policy optimization strategy.
  • Figure 5: (a) Comparison of SPO-tree (6-6-6) and GRPO on MATH500 with a context size of 2K. (b) Variations of tree structures on GSM8K. (c) SPO-tree with different advantage methods on GSM8K.
  • ...and 7 more figures