CAPO: Towards Enhancing LLM Reasoning through Generative Credit Assignment

Guofu Xie; Yunsheng Shi; Hongtao Tian; Ting Yao; Xiao Zhang

CAPO: Towards Enhancing LLM Reasoning through Generative Credit Assignment

Guofu Xie, Yunsheng Shi, Hongtao Tian, Ting Yao, Xiao Zhang

TL;DR

CAPO tackles the coarse credit assignment problem in RL-based LLM fine-tuning by introducing deterministic, step-level credits generated by an off-the-shelf LLM as a Generative Process Reward Model. It prompts the GenPRM to critique each reasoning step in a single pass and uses voting over multiple critiques to stabilize error localization, then applies asymmetric reward shaping to balance process and outcome signals. Token-level rewards are localized and normalized to produce reliable per-token advantages, enabling effective policy optimization with a principled MORL-inspired objective. Across Llama and Qwen backbones on four math and three general reasoning benchmarks, CAPO consistently outperforms supervised finetuning and other RLVR methods, suggesting it fosters correct reasoning pathways and more robust exploration. The approach is simple, broadly applicable, and reproducible with publicly available models and prompts, offering practical impact for improving LLM reasoning in online RL settings.

Abstract

Reinforcement Learning with Verifiable Rewards (RLVR) has improved the reasoning abilities of Large Language Models (LLMs) by using rule-based binary feedback. However, current RLVR methods typically assign the same reward to every token. This coarse-grained feedback hampers precise credit assignment, making it hard for models to identify which reasoning steps lead to success or failure, and often results in suboptimal policies. Methods like PPO provide credit assignment by value estimation, but yield inaccurate and unverifiable signals due to limited sampling. On the other hand, methods using Process Reward Models can provide step-wise rewards but suffer from several key limitations: they require high-quality process supervision labels, the feedback is unreliable due to probabilistic reward modeling, and their application in online reinforcement learning (RL) is time-consuming. To overcome these limitations, we introduce a simple but efficient method-Credit Assignment Policy Optimization (CAPO). Instead of training auxiliary models, CAPO directly leverages an off-the-shelf, general-purpose LLM as a Generative Process Reward Model (LLM-as-GenPRM) to generate all step-wise critique by one pass only based on the correctness of the step itself, providing deterministic token-level credits to refine the tokens that were originally assigned identical rule-based rewards. To further enhance the accuracy and robustness, we employ voting mechanisms that scale with the number of generated critiques. Extensive experiments on various backbones like Llama and Qwen models show that CAPO consistently outperforms supervised learning-based and RL-based fine-tuning methods across four challenging mathematical benchmarks and three out-of-domain benchmarks. Further analysis shows that CAPO can help the model to foster the learning of correct reasoning pathways leading to correct answers.

CAPO: Towards Enhancing LLM Reasoning through Generative Credit Assignment

TL;DR

Abstract

CAPO: Towards Enhancing LLM Reasoning through Generative Credit Assignment

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (5)