Table of Contents
Fetching ...

OP-GRPO: Efficient Off-Policy GRPO for Flow-Matching Models

Liyu Zhang, Kehan Li, Tingrui Han, Tao Zhao, Yuxuan Sheng, Shibo He, Chao Li

Abstract

Post training via GRPO has demonstrated remarkable effectiveness in improving the generation quality of flow-matching models. However, GRPO suffers from inherently low sample efficiency due to its on-policy training paradigm. To address this limitation, we present OP-GRPO, the first Off-Policy GRPO framework tailored for flow-matching models. First, we actively select high-quality trajectories and adaptively incorporate them into a replay buffer for reuse in subsequent training iterations. Second, to mitigate the distribution shift introduced by off-policy samples, we propose a sequence-level importance sampling correction that preserves the integrity of GRPO's clipping mechanism while ensuring stable policy updates. Third, we theoretically and empirically show that late denoising steps yield ill-conditioned off-policy ratios, and mitigate this by truncating trajectories at late steps. Across image and video generation benchmarks, OP-GRPO achieves comparable or superior performance to Flow-GRPO with only 34.2% of the training steps on average, yielding substantial gains in training efficiency while maintaining generation quality.

OP-GRPO: Efficient Off-Policy GRPO for Flow-Matching Models

Abstract

Post training via GRPO has demonstrated remarkable effectiveness in improving the generation quality of flow-matching models. However, GRPO suffers from inherently low sample efficiency due to its on-policy training paradigm. To address this limitation, we present OP-GRPO, the first Off-Policy GRPO framework tailored for flow-matching models. First, we actively select high-quality trajectories and adaptively incorporate them into a replay buffer for reuse in subsequent training iterations. Second, to mitigate the distribution shift introduced by off-policy samples, we propose a sequence-level importance sampling correction that preserves the integrity of GRPO's clipping mechanism while ensuring stable policy updates. Third, we theoretically and empirically show that late denoising steps yield ill-conditioned off-policy ratios, and mitigate this by truncating trajectories at late steps. Across image and video generation benchmarks, OP-GRPO achieves comparable or superior performance to Flow-GRPO with only 34.2% of the training steps on average, yielding substantial gains in training efficiency while maintaining generation quality.

Paper Structure

This paper contains 18 sections, 14 equations, 6 figures, 2 tables.

Figures (6)

  • Figure 1: Overall framework of OP-GRPO, including (a) OP-GRPO rollout and (b) OP-GRPO training. Blue regions represent samples from the replay buffer, green regions represent samples from the dataset.
  • Figure 2: Log-probability values of on-policy and off-policy samples across denoising steps, where the dashed line indicates the truncation starting step.
  • Figure 3: Training Curves of OP-GRPO and Flow GRPO.
  • Figure 4: Visual Results of OP-GRPO and Flow GRPO on three image generation tasks using SD3.5-M.
  • Figure 5: Visual Results of Buffer-based GRPO and Flow GRPO on OCR task on video generation model Wan2.1-1.4B. Refer to Appendix for more results.
  • ...and 1 more figures