Table of Contents
Fetching ...

From Sparse to Dense: Multi-View GRPO for Flow Models via Augmented Condition Space

Jiazi Bu, Pengyang Ling, Yujie Zhou, Yibin Wang, Yuhang Zang, Tianyi Wei, Xiaohang Zhan, Jiaqi Wang, Tong Wu, Xingang Pan, Dahua Lin

Abstract

Group Relative Policy Optimization (GRPO) has emerged as a powerful framework for preference alignment in text-to-image (T2I) flow models. However, we observe that the standard paradigm where evaluating a group of generated samples against a single condition suffers from insufficient exploration of inter-sample relationships, constraining both alignment efficacy and performance ceilings. To address this sparse single-view evaluation scheme, we propose Multi-View GRPO (MV-GRPO), a novel approach that enhances relationship exploration by augmenting the condition space to create a dense multi-view reward mapping. Specifically, for a group of samples generated from one prompt, MV-GRPO leverages a flexible Condition Enhancer to generate semantically adjacent yet diverse captions. These captions enable multi-view advantage re-estimation, capturing diverse semantic attributes and providing richer optimization signals. By deriving the probability distribution of the original samples conditioned on these new captions, we can incorporate them into the training process without costly sample regeneration. Extensive experiments demonstrate that MV-GRPO achieves superior alignment performance over state-of-the-art methods.

From Sparse to Dense: Multi-View GRPO for Flow Models via Augmented Condition Space

Abstract

Group Relative Policy Optimization (GRPO) has emerged as a powerful framework for preference alignment in text-to-image (T2I) flow models. However, we observe that the standard paradigm where evaluating a group of generated samples against a single condition suffers from insufficient exploration of inter-sample relationships, constraining both alignment efficacy and performance ceilings. To address this sparse single-view evaluation scheme, we propose Multi-View GRPO (MV-GRPO), a novel approach that enhances relationship exploration by augmenting the condition space to create a dense multi-view reward mapping. Specifically, for a group of samples generated from one prompt, MV-GRPO leverages a flexible Condition Enhancer to generate semantically adjacent yet diverse captions. These captions enable multi-view advantage re-estimation, capturing diverse semantic attributes and providing richer optimization signals. By deriving the probability distribution of the original samples conditioned on these new captions, we can incorporate them into the training process without costly sample regeneration. Extensive experiments demonstrate that MV-GRPO achieves superior alignment performance over state-of-the-art methods.
Paper Structure (26 sections, 24 equations, 23 figures, 9 tables, 1 algorithm)

This paper contains 26 sections, 24 equations, 23 figures, 9 tables, 1 algorithm.

Figures (23)

  • Figure 1: Gallery of MV-GRPO. Our MV-GRPO substantially elevates the generation quality of flow models (Flux.1-dev in this figure), particularly in terms of fine-grained details and photorealism. Prompts are listed in the supplementary material.
  • Figure 2: Reward Evaluation in GRPO Training. (a) Standard flow-based GRPO methods evaluate generated samples under the single original condition, resulting in sparse reward mapping and insufficient inter-sample relationship exploration. (b) Our MV-GRPO leverages an augmented set of conditions to facilitate a dense multi-view mapping, fostering a comprehensive exploration of relationship among samples.
  • Figure 3: Reward Ranking Varies with Conditions. Reward rankings of SDE samples across multiple semantically similar yet different conditions exhibit large variations, indicating that relying on a single condition for advantage estimation is inadequate.
  • Figure 4: Overview of MV-GRPO. MV-GRPO leverages a flexible Condition Enhancer module (a pretrained VLM or LLM) to generate diverse augmented conditions for dense multi-view reward signals, facilitating comprehensive advantage estimation.
  • Figure 5: Distribution of Probability Drift at Different SDE Steps. Most condition pairs exhibit a drift near zero, demonstrating that the SDE transition probability is effectively preserved when substituting the original with augmented conditions.
  • ...and 18 more figures