Table of Contents
Fetching ...

D-Fusion: Direct Preference Optimization for Aligning Diffusion Models with Visually Consistent Samples

Zijing Hu, Fengda Zhang, Kun Kuang

TL;DR

D-Fusion addresses the misalignment between diffusion-generated images and prompts by constructing RL-trainable, visually consistent samples that preserve denoising trajectories for direct preference optimization. It achieves this through cross-attention mask extraction to locate alignment regions and self-attention fusion to steer denoising toward alignment while maintaining base-image consistency. The method demonstrates improved prompt-image alignment across multiple RL fine-tuning approaches and shows generalization to unseen prompts, highlighting the importance of data consistency for RL-based diffusion model alignment. Overall, D-Fusion offers a practical pathway to more accurately align diffusion models with textual prompts in real-world applications.

Abstract

The practical applications of diffusion models have been limited by the misalignment between generated images and corresponding text prompts. Recent studies have introduced direct preference optimization (DPO) to enhance the alignment of these models. However, the effectiveness of DPO is constrained by the issue of visual inconsistency, where the significant visual disparity between well-aligned and poorly-aligned images prevents diffusion models from identifying which factors contribute positively to alignment during fine-tuning. To address this issue, this paper introduces D-Fusion, a method to construct DPO-trainable visually consistent samples. On one hand, by performing mask-guided self-attention fusion, the resulting images are not only well-aligned, but also visually consistent with given poorly-aligned images. On the other hand, D-Fusion can retain the denoising trajectories of the resulting images, which are essential for DPO training. Extensive experiments demonstrate the effectiveness of D-Fusion in improving prompt-image alignment when applied to different reinforcement learning algorithms.

D-Fusion: Direct Preference Optimization for Aligning Diffusion Models with Visually Consistent Samples

TL;DR

D-Fusion addresses the misalignment between diffusion-generated images and prompts by constructing RL-trainable, visually consistent samples that preserve denoising trajectories for direct preference optimization. It achieves this through cross-attention mask extraction to locate alignment regions and self-attention fusion to steer denoising toward alignment while maintaining base-image consistency. The method demonstrates improved prompt-image alignment across multiple RL fine-tuning approaches and shows generalization to unseen prompts, highlighting the importance of data consistency for RL-based diffusion model alignment. Overall, D-Fusion offers a practical pathway to more accurately align diffusion models with textual prompts in real-world applications.

Abstract

The practical applications of diffusion models have been limited by the misalignment between generated images and corresponding text prompts. Recent studies have introduced direct preference optimization (DPO) to enhance the alignment of these models. However, the effectiveness of DPO is constrained by the issue of visual inconsistency, where the significant visual disparity between well-aligned and poorly-aligned images prevents diffusion models from identifying which factors contribute positively to alignment during fine-tuning. To address this issue, this paper introduces D-Fusion, a method to construct DPO-trainable visually consistent samples. On one hand, by performing mask-guided self-attention fusion, the resulting images are not only well-aligned, but also visually consistent with given poorly-aligned images. On the other hand, D-Fusion can retain the denoising trajectories of the resulting images, which are essential for DPO training. Extensive experiments demonstrate the effectiveness of D-Fusion in improving prompt-image alignment when applied to different reinforcement learning algorithms.

Paper Structure

This paper contains 26 sections, 9 equations, 15 figures, 7 tables.

Figures (15)

  • Figure 1: (Misalignment) Diffusion models (e.g., Stable Diffusion (SD) rombach2022highresolutionimagesynthesislatent) often encounter the issue that the generated images do not accurately match the given prompts. Existing RL-based fine-tuning methods (e.g., DPO wallace2023diffusionmodelalignmentusing) have limited effectiveness in improving the alignment. For each set of images above, we use the same seed for sampling.
  • Figure 2: (Visual Inconsistency) When people train diffusion models with direct preference optimization (DPO), the visual disparity between well-aligned and poorly-aligned images are enormous. This visual inconsistency limits the success of DPO in enhancing diffusion models. Meanwhile, the visually consistent samples obtained through manual editing lack denoising trajectories and are not suitable for RL training. To this end, we introduce D-Fusion, which constructs RL-trainable visually consistent samples.
  • Figure 3: (Method Overview) We propose D-Fusion to construct RL-trainable visually consistent samples. (a) Each layer of the U-Net based diffusion models contains several transformer attention blocks, and each block contains a self-attention module and a cross-attention module. (b) D-Fusion constructs visually consistent samples through two steps: cross-attention mask extraction and self-attention fusion. (c) Examples of visually consistent samples. Each set consists of three images: the reference image, the base image, and the target image. The target images are not only as well-aligned as the reference images but also maintain visual consistency with the base images.
  • Figure 4: (Qualitative Results) Examples of images generated by original model and fine-tuned models on three templates. For each set of images, we use the same random seed. For both training prompts (top) and test prompts (bottom), the models fine-tuned by DPO+D-Fusion achieves better prompt-image alignment compared to the original model and the models fine-tuned by naive DPO.
  • Figure 5: (Alignment) Alignment curves of the diffusion models fine-tuned with or without D-Fusion on three prompt templates. Results show that training with D-Fusion can enhance the alignment of diffusion models to a greater extent.
  • ...and 10 more figures