Table of Contents
Fetching ...

Fine-Grained GRPO for Precise Preference Alignment in Flow Models

Yujie Zhou, Pengyang Ling, Jiazi Bu, Yibin Wang, Yuhang Zang, Jiaqi Wang, Li Niu, Guangtao Zhai

TL;DR

This work tackles the challenge of aligning flow-based generative models with human preferences by addressing sparse reward signals and fixed denoising granularity in prior flow-GRPO methods. It introduces Granular-GRPO (G$^2$RPO), which combines Singular Stochastic Sampling to confine stochasticity to a single denoising step and Multi-Granularity Advantage Integration (MGAI) to aggregate learning signals across multiple denoising granularities. Empirical results show that G$^2$RPO outperforms existing baselines on in-domain and out-of-domain reward models, with MGAI providing robust generalization and improved alignment across diverse prompts and metrics. The approach advances practical human-preference alignment for flow-based models and offers a scalable path toward more faithful image generation under varied prompts and evaluation criteria.

Abstract

The incorporation of online reinforcement learning (RL) into diffusion and flow-based generative models has recently gained attention as a powerful paradigm for aligning model behavior with human preferences. By leveraging stochastic sampling via Stochastic Differential Equations (SDEs) during the denoising phase, these models can explore a variety of denoising trajectories, enhancing the exploratory capacity of RL. However, despite their ability to discover potentially high-reward samples, current approaches often struggle to effectively align with preferences due to the sparsity and narrowness of reward feedback. To overcome this limitation, we introduce a novel framework called Granular-GRPO (G$^2$RPO), which enables fine-grained and comprehensive evaluation of sampling directions in the RL training of flow models. Specifically, we propose a Singular Stochastic Sampling mechanism that supports step-wise stochastic exploration while ensuring strong correlation between injected noise and reward signals, enabling more accurate credit assignment to each SDE perturbation. Additionally, to mitigate the bias introduced by fixed-granularity denoising, we design a Multi-Granularity Advantage Integration module that aggregates advantages computed across multiple diffusion scales, resulting in a more robust and holistic assessment of sampling trajectories. Extensive experiments on various reward models, including both in-domain and out-of-domain settings, demonstrate that our G$^2$RPO outperforms existing flow-based GRPO baselines, highlighting its effectiveness and generalization capability.

Fine-Grained GRPO for Precise Preference Alignment in Flow Models

TL;DR

This work tackles the challenge of aligning flow-based generative models with human preferences by addressing sparse reward signals and fixed denoising granularity in prior flow-GRPO methods. It introduces Granular-GRPO (GRPO), which combines Singular Stochastic Sampling to confine stochasticity to a single denoising step and Multi-Granularity Advantage Integration (MGAI) to aggregate learning signals across multiple denoising granularities. Empirical results show that GRPO outperforms existing baselines on in-domain and out-of-domain reward models, with MGAI providing robust generalization and improved alignment across diverse prompts and metrics. The approach advances practical human-preference alignment for flow-based models and offers a scalable path toward more faithful image generation under varied prompts and evaluation criteria.

Abstract

The incorporation of online reinforcement learning (RL) into diffusion and flow-based generative models has recently gained attention as a powerful paradigm for aligning model behavior with human preferences. By leveraging stochastic sampling via Stochastic Differential Equations (SDEs) during the denoising phase, these models can explore a variety of denoising trajectories, enhancing the exploratory capacity of RL. However, despite their ability to discover potentially high-reward samples, current approaches often struggle to effectively align with preferences due to the sparsity and narrowness of reward feedback. To overcome this limitation, we introduce a novel framework called Granular-GRPO (GRPO), which enables fine-grained and comprehensive evaluation of sampling directions in the RL training of flow models. Specifically, we propose a Singular Stochastic Sampling mechanism that supports step-wise stochastic exploration while ensuring strong correlation between injected noise and reward signals, enabling more accurate credit assignment to each SDE perturbation. Additionally, to mitigate the bias introduced by fixed-granularity denoising, we design a Multi-Granularity Advantage Integration module that aggregates advantages computed across multiple diffusion scales, resulting in a more robust and holistic assessment of sampling trajectories. Extensive experiments on various reward models, including both in-domain and out-of-domain settings, demonstrate that our GRPO outperforms existing flow-based GRPO baselines, highlighting its effectiveness and generalization capability.

Paper Structure

This paper contains 20 sections, 14 equations, 11 figures, 5 tables, 1 algorithm.

Figures (11)

  • Figure 1: Overview of G$^2$RPO. Given a text prompt and an initial noise, our Singular Stochastic Sampling strategy employs SDE sampling solely at a single step and samples a group of distinct denoising directions. Then, the Multi-Granularity Advantage Integration module executes multi-granularity ODE denoising for each direction and integrates the advantages to produce a comprehensive evaluation for each sampling direction. For simplicity, the figure shows one coarse-grained path (denoted as $c$) and one fine-grained path (denoted as $f$).
  • Figure 2: Visual Comparison of Images Denoised at Different Granularities. Images denoised at different granularities exhibit variations in fine details and textures, leading to inconsistent scoring by the Reward Model (HPS-v2.1). This observation reveals the insufficiency of a single-granularity evaluation of group advantage.
  • Figure 3: Qualitative Results. Comparison with existing flow-based GRPO methods, in which our G$^2$RPO demonstrates superior performance in human preference alignment.
  • Figure 4: Visual ablation study of MGAI module.
  • Figure 5: Qualitative comparison with existing GRPO methods (1/3).
  • ...and 6 more figures