Table of Contents
Fetching ...

Directly Aligning the Full Diffusion Trajectory with Fine-Grained Human Preference

Xiangwei Shen, Zhimin Li, Zhantao Yang, Shiyi Zhang, Yingfang Zhang, Donghao Li, Chunyu Wang, Qinglin Lu, Yansong Tang

TL;DR

Problem: existing methods for aligning diffusion models with human preferences suffer from expensive gradient-through-time and offline reward tuning. Approach: Direct-Align uses a noise-prior interpolation for early-timestep optimization and SRPO deploys text-conditioned rewards via prompt augmentation with inversion-based regularization. Contributions: a unified online-RL framework achieving state-of-the-art realism and aesthetics on HPDv2 with about 10-minute convergence on 32 GPUs, plus robust reward-bias mitigation and training efficiency gains (up to ~75× vs DanceGRPO). Impact: enables fine-grained, data-efficient alignment of large-scale diffusion outputs to nuanced human preferences without heavy offline reward tuning, with broad applicability to diffusion-guided image synthesis.

Abstract

Recent studies have demonstrated the effectiveness of directly aligning diffusion models with human preferences using differentiable reward. However, they exhibit two primary challenges: (1) they rely on multistep denoising with gradient computation for reward scoring, which is computationally expensive, thus restricting optimization to only a few diffusion steps; (2) they often need continuous offline adaptation of reward models in order to achieve desired aesthetic quality, such as photorealism or precise lighting effects. To address the limitation of multistep denoising, we propose Direct-Align, a method that predefines a noise prior to effectively recover original images from any time steps via interpolation, leveraging the equation that diffusion states are interpolations between noise and target images, which effectively avoids over-optimization in late timesteps. Furthermore, we introduce Semantic Relative Preference Optimization (SRPO), in which rewards are formulated as text-conditioned signals. This approach enables online adjustment of rewards in response to positive and negative prompt augmentation, thereby reducing the reliance on offline reward fine-tuning. By fine-tuning the FLUX model with optimized denoising and online reward adjustment, we improve its human-evaluated realism and aesthetic quality by over 3x.

Directly Aligning the Full Diffusion Trajectory with Fine-Grained Human Preference

TL;DR

Problem: existing methods for aligning diffusion models with human preferences suffer from expensive gradient-through-time and offline reward tuning. Approach: Direct-Align uses a noise-prior interpolation for early-timestep optimization and SRPO deploys text-conditioned rewards via prompt augmentation with inversion-based regularization. Contributions: a unified online-RL framework achieving state-of-the-art realism and aesthetics on HPDv2 with about 10-minute convergence on 32 GPUs, plus robust reward-bias mitigation and training efficiency gains (up to ~75× vs DanceGRPO). Impact: enables fine-grained, data-efficient alignment of large-scale diffusion outputs to nuanced human preferences without heavy offline reward tuning, with broad applicability to diffusion-guided image synthesis.

Abstract

Recent studies have demonstrated the effectiveness of directly aligning diffusion models with human preferences using differentiable reward. However, they exhibit two primary challenges: (1) they rely on multistep denoising with gradient computation for reward scoring, which is computationally expensive, thus restricting optimization to only a few diffusion steps; (2) they often need continuous offline adaptation of reward models in order to achieve desired aesthetic quality, such as photorealism or precise lighting effects. To address the limitation of multistep denoising, we propose Direct-Align, a method that predefines a noise prior to effectively recover original images from any time steps via interpolation, leveraging the equation that diffusion states are interpolations between noise and target images, which effectively avoids over-optimization in late timesteps. Furthermore, we introduce Semantic Relative Preference Optimization (SRPO), in which rewards are formulated as text-conditioned signals. This approach enables online adjustment of rewards in response to positive and negative prompt augmentation, thereby reducing the reliance on offline reward fine-tuning. By fine-tuning the FLUX model with optimized denoising and online reward adjustment, we improve its human-evaluated realism and aesthetic quality by over 3x.

Paper Structure

This paper contains 16 sections, 8 equations, 14 figures, 1 table.

Figures (14)

  • Figure 1: Images generated by FLUX.1-dev finetuned through our Semantic Relative Preference Optimization (SRPO) Our method substantially improves upon the baseline model, achieving superior photorealism and enhanced fine-grained detail while maintaining remarkable training efficiency-converging in just 10 minutes using 32 NVIDIA H20 GPUs.
  • Figure 2: Method Overview. The SRPO contains two key elements: Direct-Align, and a single reward model that derives both rewards and penalties from positive and negative prompts. The pipeline of Direct-Align consists of four stages: (0) generate/load an image for training; (1) inject noise into image; (2) perform one-step denoise/inversion; (3) recover image.
  • Figure 3: Comparison on one-step prediction at early timestep The values 0.075 and 0.025 denote the weight of the model prediction term used for method, respectively. The earliest 5% represent state with 95% noise from an unshifted timestep. By constructing a Gaussian prior, our one-step sampling method achieves high-quality results at early timesteps, even when the input image is highly noised.
  • Figure 4: Comparison of human evaluation results for Vanilla FLUX, ReFL, DRaFT_LV, DanceGRPO, Direct-Align, and SRPO on the criteria of Realism, Aesthetics, and Overall Preference. SRPO demonstrates significant improvements in Aesthetics and achieves a substantial reduction in AIGC artifacts.
  • Figure 5: Qualitative Comparison on FLUX, DanceGRPO and SRPO with same seed. Our approach demonstrates superior performance in realism and detail complexity.
  • ...and 9 more figures