D-Fusion: Direct Preference Optimization for Aligning Diffusion Models with Visually Consistent Samples
Zijing Hu, Fengda Zhang, Kun Kuang
TL;DR
D-Fusion addresses the misalignment between diffusion-generated images and prompts by constructing RL-trainable, visually consistent samples that preserve denoising trajectories for direct preference optimization. It achieves this through cross-attention mask extraction to locate alignment regions and self-attention fusion to steer denoising toward alignment while maintaining base-image consistency. The method demonstrates improved prompt-image alignment across multiple RL fine-tuning approaches and shows generalization to unseen prompts, highlighting the importance of data consistency for RL-based diffusion model alignment. Overall, D-Fusion offers a practical pathway to more accurately align diffusion models with textual prompts in real-world applications.
Abstract
The practical applications of diffusion models have been limited by the misalignment between generated images and corresponding text prompts. Recent studies have introduced direct preference optimization (DPO) to enhance the alignment of these models. However, the effectiveness of DPO is constrained by the issue of visual inconsistency, where the significant visual disparity between well-aligned and poorly-aligned images prevents diffusion models from identifying which factors contribute positively to alignment during fine-tuning. To address this issue, this paper introduces D-Fusion, a method to construct DPO-trainable visually consistent samples. On one hand, by performing mask-guided self-attention fusion, the resulting images are not only well-aligned, but also visually consistent with given poorly-aligned images. On the other hand, D-Fusion can retain the denoising trajectories of the resulting images, which are essential for DPO training. Extensive experiments demonstrate the effectiveness of D-Fusion in improving prompt-image alignment when applied to different reinforcement learning algorithms.
