PCPO: Proportionate Credit Policy Optimization for Aligning Image Generation Models
Jeongjae Lee, Jong Chul Ye
TL;DR
The paper identifies disproportionate credit assignment as a core source of instability in policy-gradient alignment for image-generation models. It introduces Proportionate Credit Policy Optimization (PCPO), which enforces uniform Bayesian-inspired credit across timesteps by reweighting diffusion and flow timesteps, stabilizing training and accelerating convergence. Empirical results show PCPO outperforms strong baselines like DanceGRPO across multiple models, prompts, and rewards, with improved fidelity, diversity, and generalization, plus human preferences favoring PCPO outputs. The work also introduces Implicit Reward Guidance for flexible, at-inference reward composition and discusses avenues for further stabilization and broader applicability.
Abstract
While reinforcement learning has advanced the alignment of text-to-image (T2I) models, state-of-the-art policy gradient methods are still hampered by training instability and high variance, hindering convergence speed and compromising image quality. Our analysis identifies a key cause of this instability: disproportionate credit assignment, in which the mathematical structure of the generative sampler produces volatile and non-proportional feedback across timesteps. To address this, we introduce Proportionate Credit Policy Optimization (PCPO), a framework that enforces proportional credit assignment through a stable objective reformulation and a principled reweighting of timesteps. This correction stabilizes the training process, leading to significantly accelerated convergence and superior image quality. The improvement in quality is a direct result of mitigating model collapse, a common failure mode in recursive training. PCPO substantially outperforms existing policy gradient baselines on all fronts, including the state-of-the-art DanceGRPO. Code is available at https://github.com/jaylee2000/pcpo/.
