Table of Contents
Fetching ...

SUDO: Enhancing Text-to-Image Diffusion Models with Self-Supervised Direct Preference Optimization

Liang Peng, Boxi Wu, Haoran Cheng, Yibo Zhao, Xiaofei He

TL;DR

Pixel-focused fine-tuning via MSE can neglect global image quality in text-to-image diffusion models. SUDO introduces self-supervised direct preference optimization to inject image-level learning by generating self-generated losing samples and optimizing a combined objective, improving both alignment and perceptual quality. Across SD1.5 and SDXL on Pick-a-Pic V2, SUDO outperforms SFT and often rivals DPO on multiple metrics, while remaining model-agnostic and annotation-free. This approach reduces labeling costs and enhances global coherence, making diffusion-based T2I systems more reliable and scalable.

Abstract

Previous text-to-image diffusion models typically employ supervised fine-tuning (SFT) to enhance pre-trained base models. However, this approach primarily minimizes the loss of mean squared error (MSE) at the pixel level, neglecting the need for global optimization at the image level, which is crucial for achieving high perceptual quality and structural coherence. In this paper, we introduce Self-sUpervised Direct preference Optimization (SUDO), a novel paradigm that optimizes both fine-grained details at the pixel level and global image quality. By integrating direct preference optimization into the model, SUDO generates preference image pairs in a self-supervised manner, enabling the model to prioritize global-level learning while complementing the pixel-level MSE loss. As an effective alternative to supervised fine-tuning, SUDO can be seamlessly applied to any text-to-image diffusion model. Importantly, it eliminates the need for costly data collection and annotation efforts typically associated with traditional direct preference optimization methods. Through extensive experiments on widely-used models, including Stable Diffusion 1.5 and XL, we demonstrate that SUDO significantly enhances both global and local image quality. The codes are provided at \href{https://github.com/SPengLiang/SUDO}{this link}.

SUDO: Enhancing Text-to-Image Diffusion Models with Self-Supervised Direct Preference Optimization

TL;DR

Pixel-focused fine-tuning via MSE can neglect global image quality in text-to-image diffusion models. SUDO introduces self-supervised direct preference optimization to inject image-level learning by generating self-generated losing samples and optimizing a combined objective, improving both alignment and perceptual quality. Across SD1.5 and SDXL on Pick-a-Pic V2, SUDO outperforms SFT and often rivals DPO on multiple metrics, while remaining model-agnostic and annotation-free. This approach reduces labeling costs and enhances global coherence, making diffusion-based T2I systems more reliable and scalable.

Abstract

Previous text-to-image diffusion models typically employ supervised fine-tuning (SFT) to enhance pre-trained base models. However, this approach primarily minimizes the loss of mean squared error (MSE) at the pixel level, neglecting the need for global optimization at the image level, which is crucial for achieving high perceptual quality and structural coherence. In this paper, we introduce Self-sUpervised Direct preference Optimization (SUDO), a novel paradigm that optimizes both fine-grained details at the pixel level and global image quality. By integrating direct preference optimization into the model, SUDO generates preference image pairs in a self-supervised manner, enabling the model to prioritize global-level learning while complementing the pixel-level MSE loss. As an effective alternative to supervised fine-tuning, SUDO can be seamlessly applied to any text-to-image diffusion model. Importantly, it eliminates the need for costly data collection and annotation efforts typically associated with traditional direct preference optimization methods. Through extensive experiments on widely-used models, including Stable Diffusion 1.5 and XL, we demonstrate that SUDO significantly enhances both global and local image quality. The codes are provided at \href{https://github.com/SPengLiang/SUDO}{this link}.

Paper Structure

This paper contains 15 sections, 9 equations, 7 figures, 4 tables.

Figures (7)

  • Figure 1: Pixel-level and image-level optimization. In the conventional fine-tuning process, previous diffusion methods typically focus on pixel-level MSE loss. We enhance text-to-image diffusion models by incorporating image-level learning, achieved through self-supervised direct preference optimization.
  • Figure 1: Extra qualitative comparisons with the SD1.5 base model.
  • Figure 2: We develop SUDO, a method for fine-tuning text-to-image diffusion models. It incorporates direct preference optimization in a self-supervised manner. We provide the results generated by the fine-tuned SDXL model with our method. Best viewed in color.
  • Figure 2: Extra qualitative comparisons with the SDXL base model.
  • Figure 3: Different training process of SFT, DPO, and SUDO. We generate losing images in a self-supervised manner, to perform direct preference optimization under a regular dataset. Best viewed in color with zoom in.
  • ...and 2 more figures