Pinpointing crucial steps: Attribution-based Credit Assignment for Verifiable Reinforcement Learning
Junxi Yin, Haisen Luo, Zhenyu Li, Yihua Liu, Dan Liu, Zequn Li, Xiaohang Xu
TL;DR
ACPO tackles two intertwined challenges in RLVR: precise step-level credit assignment and managing the exploration–exploitation trade-off during long, multi-step reasoning. It introduces a two-stage framework that combines trajectory semantic segmentation, an information-theoretic step attribution metric, and a dynamic entropy-regulation mechanism to guide exploration, followed by a convergence-focused phase with a reference policy and confidence-weighted rewards. The core contributions include a step-wise attribution credit mechanism with a lightweight mutual information surrogate, a dynamic segmentation strategy, and a two-stage curriculum that transitions from broad exploration to targeted convergence. Empirical results on math reasoning benchmarks (AIME, AMC, MATH500) show substantial gains over GRPO-based baselines, validating the method's ability to verify intermediate steps and produce more robust, verifiable reasoning.
Abstract
While Reinforcement Learning with Verifiable Rewards (RLVR) enhances complex reasoning in LLMs, current methods struggle to balance exploration and exploitation. This leads to critical issues like inaccurate credit assignment for intermediate steps and premature entropy collapse, limiting model performance. To address this, we introduce Attribution-based Contribution to Policy Optimization (ACPO), a phased framework that incorporates a difficulty-aware curriculum. ACPO improves exploration by using trajectory semantic segmentation and an attribution-based representation to dynamically regulate policy entropy, thus mitigating its collapse. Concurrently, it enhances exploitation with a factorized reward system that precisely quantifies the hierarchical contribution of each reasoning step, ensuring accurate credit assignment. Extensive experiments on challenging benchmarks, including AIME, MATH, and AMC, demonstrate that ACPO significantly outperforms existing state-of-the-art approaches.
