Table of Contents
Fetching ...

CAPO: Towards Enhancing LLM Reasoning through Generative Credit Assignment

Guofu Xie, Yunsheng Shi, Hongtao Tian, Ting Yao, Xiao Zhang

TL;DR

CAPO tackles the coarse credit assignment problem in RL-based LLM fine-tuning by introducing deterministic, step-level credits generated by an off-the-shelf LLM as a Generative Process Reward Model. It prompts the GenPRM to critique each reasoning step in a single pass and uses voting over multiple critiques to stabilize error localization, then applies asymmetric reward shaping to balance process and outcome signals. Token-level rewards are localized and normalized to produce reliable per-token advantages, enabling effective policy optimization with a principled MORL-inspired objective. Across Llama and Qwen backbones on four math and three general reasoning benchmarks, CAPO consistently outperforms supervised finetuning and other RLVR methods, suggesting it fosters correct reasoning pathways and more robust exploration. The approach is simple, broadly applicable, and reproducible with publicly available models and prompts, offering practical impact for improving LLM reasoning in online RL settings.

Abstract

Reinforcement Learning with Verifiable Rewards (RLVR) has improved the reasoning abilities of Large Language Models (LLMs) by using rule-based binary feedback. However, current RLVR methods typically assign the same reward to every token. This coarse-grained feedback hampers precise credit assignment, making it hard for models to identify which reasoning steps lead to success or failure, and often results in suboptimal policies. Methods like PPO provide credit assignment by value estimation, but yield inaccurate and unverifiable signals due to limited sampling. On the other hand, methods using Process Reward Models can provide step-wise rewards but suffer from several key limitations: they require high-quality process supervision labels, the feedback is unreliable due to probabilistic reward modeling, and their application in online reinforcement learning (RL) is time-consuming. To overcome these limitations, we introduce a simple but efficient method-Credit Assignment Policy Optimization (CAPO). Instead of training auxiliary models, CAPO directly leverages an off-the-shelf, general-purpose LLM as a Generative Process Reward Model (LLM-as-GenPRM) to generate all step-wise critique by one pass only based on the correctness of the step itself, providing deterministic token-level credits to refine the tokens that were originally assigned identical rule-based rewards. To further enhance the accuracy and robustness, we employ voting mechanisms that scale with the number of generated critiques. Extensive experiments on various backbones like Llama and Qwen models show that CAPO consistently outperforms supervised learning-based and RL-based fine-tuning methods across four challenging mathematical benchmarks and three out-of-domain benchmarks. Further analysis shows that CAPO can help the model to foster the learning of correct reasoning pathways leading to correct answers.

CAPO: Towards Enhancing LLM Reasoning through Generative Credit Assignment

TL;DR

CAPO tackles the coarse credit assignment problem in RL-based LLM fine-tuning by introducing deterministic, step-level credits generated by an off-the-shelf LLM as a Generative Process Reward Model. It prompts the GenPRM to critique each reasoning step in a single pass and uses voting over multiple critiques to stabilize error localization, then applies asymmetric reward shaping to balance process and outcome signals. Token-level rewards are localized and normalized to produce reliable per-token advantages, enabling effective policy optimization with a principled MORL-inspired objective. Across Llama and Qwen backbones on four math and three general reasoning benchmarks, CAPO consistently outperforms supervised finetuning and other RLVR methods, suggesting it fosters correct reasoning pathways and more robust exploration. The approach is simple, broadly applicable, and reproducible with publicly available models and prompts, offering practical impact for improving LLM reasoning in online RL settings.

Abstract

Reinforcement Learning with Verifiable Rewards (RLVR) has improved the reasoning abilities of Large Language Models (LLMs) by using rule-based binary feedback. However, current RLVR methods typically assign the same reward to every token. This coarse-grained feedback hampers precise credit assignment, making it hard for models to identify which reasoning steps lead to success or failure, and often results in suboptimal policies. Methods like PPO provide credit assignment by value estimation, but yield inaccurate and unverifiable signals due to limited sampling. On the other hand, methods using Process Reward Models can provide step-wise rewards but suffer from several key limitations: they require high-quality process supervision labels, the feedback is unreliable due to probabilistic reward modeling, and their application in online reinforcement learning (RL) is time-consuming. To overcome these limitations, we introduce a simple but efficient method-Credit Assignment Policy Optimization (CAPO). Instead of training auxiliary models, CAPO directly leverages an off-the-shelf, general-purpose LLM as a Generative Process Reward Model (LLM-as-GenPRM) to generate all step-wise critique by one pass only based on the correctness of the step itself, providing deterministic token-level credits to refine the tokens that were originally assigned identical rule-based rewards. To further enhance the accuracy and robustness, we employ voting mechanisms that scale with the number of generated critiques. Extensive experiments on various backbones like Llama and Qwen models show that CAPO consistently outperforms supervised learning-based and RL-based fine-tuning methods across four challenging mathematical benchmarks and three out-of-domain benchmarks. Further analysis shows that CAPO can help the model to foster the learning of correct reasoning pathways leading to correct answers.

Paper Structure

This paper contains 38 sections, 43 equations, 5 figures, 12 tables.

Figures (5)

  • Figure 1: Addressing the coarse credit assignment problem with CAPO. (a) A schematic illustrating the core limitation of RLVR, where a single final reward fails to provide granular feedback for the reasoning process. (b) The training dynamics of our CAPO, on Qwen2.5-7B, demonstrating effective learning where increased exploration (longer responses) correlates with higher final accuracy.
  • Figure 2: An overview of our method Credit Assignment Policy Optimization (CAPO). We utilize a LLM as a GenPRM to identify all incorrect steps within a model's generated rollout in a single pass. During credit assignment, we then suppress these erroneous steps, which prevents the correct portions of the sequence from being unfairly penalized, enabling the model to learn correct reasoning pathways. We denote $W_\mathrm{whole}$ and $W_\mathrm{process}$ as C and P for short.
  • Figure 3: Performance analysis of CAPO with a varying number of critiques ($N \in \{2, 4, 8\}$) generated by LLM-as-GenPRM on Qwen2.5-1.5B CAPO-Intersection. (a) Performance trend on the math dataset. (b) Detailed performance on OOD benchmarks.
  • Figure 4: Training dynamics of CAPO (C=2,P=5) on Qwen2.5-1.5B. The figure plots the accuracy, response length, and mean per-token reward over training steps.
  • Figure 5: Effect of the number of samples generated by GenORM on GRPO Performance. The plot illustrates the performance of GRPO when using a varying number of samples, $N \in \{2, 4, 8\}$ from the GenORM. Increasing the sample size $N$ leads to unstable performance increase and the performance increase is relatively small compared to using LLM-as-GenRRM in CAPO.