SUDO: Enhancing Text-to-Image Diffusion Models with Self-Supervised Direct Preference Optimization

Liang Peng; Boxi Wu; Haoran Cheng; Yibo Zhao; Xiaofei He

SUDO: Enhancing Text-to-Image Diffusion Models with Self-Supervised Direct Preference Optimization

Liang Peng, Boxi Wu, Haoran Cheng, Yibo Zhao, Xiaofei He

TL;DR

Pixel-focused fine-tuning via MSE can neglect global image quality in text-to-image diffusion models. SUDO introduces self-supervised direct preference optimization to inject image-level learning by generating self-generated losing samples and optimizing a combined objective, improving both alignment and perceptual quality. Across SD1.5 and SDXL on Pick-a-Pic V2, SUDO outperforms SFT and often rivals DPO on multiple metrics, while remaining model-agnostic and annotation-free. This approach reduces labeling costs and enhances global coherence, making diffusion-based T2I systems more reliable and scalable.

Abstract

Previous text-to-image diffusion models typically employ supervised fine-tuning (SFT) to enhance pre-trained base models. However, this approach primarily minimizes the loss of mean squared error (MSE) at the pixel level, neglecting the need for global optimization at the image level, which is crucial for achieving high perceptual quality and structural coherence. In this paper, we introduce Self-sUpervised Direct preference Optimization (SUDO), a novel paradigm that optimizes both fine-grained details at the pixel level and global image quality. By integrating direct preference optimization into the model, SUDO generates preference image pairs in a self-supervised manner, enabling the model to prioritize global-level learning while complementing the pixel-level MSE loss. As an effective alternative to supervised fine-tuning, SUDO can be seamlessly applied to any text-to-image diffusion model. Importantly, it eliminates the need for costly data collection and annotation efforts typically associated with traditional direct preference optimization methods. Through extensive experiments on widely-used models, including Stable Diffusion 1.5 and XL, we demonstrate that SUDO significantly enhances both global and local image quality. The codes are provided at \href{https://github.com/SPengLiang/SUDO}{this link}.

SUDO: Enhancing Text-to-Image Diffusion Models with Self-Supervised Direct Preference Optimization

TL;DR

Abstract

SUDO: Enhancing Text-to-Image Diffusion Models with Self-Supervised Direct Preference Optimization

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (7)