Table of Contents
Fetching ...

PCPO: Proportionate Credit Policy Optimization for Aligning Image Generation Models

Jeongjae Lee, Jong Chul Ye

TL;DR

The paper identifies disproportionate credit assignment as a core source of instability in policy-gradient alignment for image-generation models. It introduces Proportionate Credit Policy Optimization (PCPO), which enforces uniform Bayesian-inspired credit across timesteps by reweighting diffusion and flow timesteps, stabilizing training and accelerating convergence. Empirical results show PCPO outperforms strong baselines like DanceGRPO across multiple models, prompts, and rewards, with improved fidelity, diversity, and generalization, plus human preferences favoring PCPO outputs. The work also introduces Implicit Reward Guidance for flexible, at-inference reward composition and discusses avenues for further stabilization and broader applicability.

Abstract

While reinforcement learning has advanced the alignment of text-to-image (T2I) models, state-of-the-art policy gradient methods are still hampered by training instability and high variance, hindering convergence speed and compromising image quality. Our analysis identifies a key cause of this instability: disproportionate credit assignment, in which the mathematical structure of the generative sampler produces volatile and non-proportional feedback across timesteps. To address this, we introduce Proportionate Credit Policy Optimization (PCPO), a framework that enforces proportional credit assignment through a stable objective reformulation and a principled reweighting of timesteps. This correction stabilizes the training process, leading to significantly accelerated convergence and superior image quality. The improvement in quality is a direct result of mitigating model collapse, a common failure mode in recursive training. PCPO substantially outperforms existing policy gradient baselines on all fronts, including the state-of-the-art DanceGRPO. Code is available at https://github.com/jaylee2000/pcpo/.

PCPO: Proportionate Credit Policy Optimization for Aligning Image Generation Models

TL;DR

The paper identifies disproportionate credit assignment as a core source of instability in policy-gradient alignment for image-generation models. It introduces Proportionate Credit Policy Optimization (PCPO), which enforces uniform Bayesian-inspired credit across timesteps by reweighting diffusion and flow timesteps, stabilizing training and accelerating convergence. Empirical results show PCPO outperforms strong baselines like DanceGRPO across multiple models, prompts, and rewards, with improved fidelity, diversity, and generalization, plus human preferences favoring PCPO outputs. The work also introduces Implicit Reward Guidance for flexible, at-inference reward composition and discusses avenues for further stabilization and broader applicability.

Abstract

While reinforcement learning has advanced the alignment of text-to-image (T2I) models, state-of-the-art policy gradient methods are still hampered by training instability and high variance, hindering convergence speed and compromising image quality. Our analysis identifies a key cause of this instability: disproportionate credit assignment, in which the mathematical structure of the generative sampler produces volatile and non-proportional feedback across timesteps. To address this, we introduce Proportionate Credit Policy Optimization (PCPO), a framework that enforces proportional credit assignment through a stable objective reformulation and a principled reweighting of timesteps. This correction stabilizes the training process, leading to significantly accelerated convergence and superior image quality. The improvement in quality is a direct result of mitigating model collapse, a common failure mode in recursive training. PCPO substantially outperforms existing policy gradient baselines on all fronts, including the state-of-the-art DanceGRPO. Code is available at https://github.com/jaylee2000/pcpo/.

Paper Structure

This paper contains 34 sections, 4 theorems, 31 equations, 23 figures, 9 tables, 1 algorithm.

Key Result

Proposition 1

For a DDIM sampling schedule, the log policy ratio $\log \rho_t$ is given by: where

Figures (23)

  • Figure 1: Qualitative comparison of baseline methods (top) and PCPO (bottom) on identical prompts and seeds. PCPO mitigates model collapse seen in baselines across different frameworks. (a) DDPO (SD1.5, Aesthetics): At a matched reward level, PCPO preserves diversity and fidelity while DDPO collapses into a blurry, homogenous style. (b) DanceGRPO (FLUX, HPSv2.1): After training for 200 epochs, PCPO achieves both a higher reward and superior image quality, avoiding artifacts observed in the baseline.
  • Figure 2: Weight rescaling by PCPO.DDIM Sampler: (a) Volatile native weights $w(t)$ (blue) are replaced with uniform, rescaled weight (orange). (b) This is achieved by computing a new variance signal $\tilde{\sigma}_t$ that remains close to the original (corresponding to $w^\star = 4.5$ (olive)), then rescaling. Light blue corresponds to $w^\star = 4.0$, gray to $w^\star = 5.0$. SDE Sampler: Native (blue) and rescaled (orange) weights for (c) DanceGRPO SDE, (d) Flow-GRPO SDE.
  • Figure 3: Reward and clipping fraction traces for PCPO (orange) vs. baselines (blue): (a) DDPO, Aesthetics, (b) DDPO, BERTScore, (c) DanceGRPO (SD1.4), HPSv2.1, (d) DanceGRPO (FLUX), HPSv2.1.
  • Figure 3: LMM analysis for DDPO (Aesthetics).
  • Figure 4: LMM analysis for DanceGRPO (HPS).
  • ...and 18 more figures

Theorems & Definitions (6)

  • Proposition 1
  • Proposition 2
  • Proposition 2
  • proof
  • Proposition 2
  • proof