Table of Contents
Fetching ...

Free Lunch Alignment of Text-to-Image Diffusion Models without Preference Image Pairs

Jia Jun Cheng Xian, Muchen Li, Haotian Yang, Xin Tao, Pengfei Wan, Leonid Sigal, Renjie Liao

TL;DR

This work tackles text–image alignment for diffusion-based T2I models by removing the need for human preference image data. It introduces Text Preference Optimization (TPO), which uses LLM-generated mismatched prompts to create text-level preference pairs and tunes diffusion models via TDPO and TKTO, building on and generalizing prior Diffusion‑DPO/KTO approaches. Across multiple benchmarks, the proposed methods achieve state-of-the-art or competitive human-preference alignment without image-annotation data, demonstrating strong transferability and scalability. The approach is model-agnostic, integrates with existing RLHF pipelines, and is supported by open-source code to enable broad adoption and extension to other modalities.

Abstract

Recent advances in diffusion-based text-to-image (T2I) models have led to remarkable success in generating high-quality images from textual prompts. However, ensuring accurate alignment between the text and the generated image remains a significant challenge for state-of-the-art diffusion models. To address this, existing studies employ reinforcement learning with human feedback (RLHF) to align T2I outputs with human preferences. These methods, however, either rely directly on paired image preference data or require a learned reward function, both of which depend heavily on costly, high-quality human annotations and thus face scalability limitations. In this work, we introduce Text Preference Optimization (TPO), a framework that enables "free-lunch" alignment of T2I models, achieving alignment without the need for paired image preference data. TPO works by training the model to prefer matched prompts over mismatched prompts, which are constructed by perturbing original captions using a large language model. Our framework is general and compatible with existing preference-based algorithms. We extend both DPO and KTO to our setting, resulting in TDPO and TKTO. Quantitative and qualitative evaluations across multiple benchmarks show that our methods consistently outperform their original counterparts, delivering better human preference scores and improved text-to-image alignment. Our Open-source code is available at https://github.com/DSL-Lab/T2I-Free-Lunch-Alignment.

Free Lunch Alignment of Text-to-Image Diffusion Models without Preference Image Pairs

TL;DR

This work tackles text–image alignment for diffusion-based T2I models by removing the need for human preference image data. It introduces Text Preference Optimization (TPO), which uses LLM-generated mismatched prompts to create text-level preference pairs and tunes diffusion models via TDPO and TKTO, building on and generalizing prior Diffusion‑DPO/KTO approaches. Across multiple benchmarks, the proposed methods achieve state-of-the-art or competitive human-preference alignment without image-annotation data, demonstrating strong transferability and scalability. The approach is model-agnostic, integrates with existing RLHF pipelines, and is supported by open-source code to enable broad adoption and extension to other modalities.

Abstract

Recent advances in diffusion-based text-to-image (T2I) models have led to remarkable success in generating high-quality images from textual prompts. However, ensuring accurate alignment between the text and the generated image remains a significant challenge for state-of-the-art diffusion models. To address this, existing studies employ reinforcement learning with human feedback (RLHF) to align T2I outputs with human preferences. These methods, however, either rely directly on paired image preference data or require a learned reward function, both of which depend heavily on costly, high-quality human annotations and thus face scalability limitations. In this work, we introduce Text Preference Optimization (TPO), a framework that enables "free-lunch" alignment of T2I models, achieving alignment without the need for paired image preference data. TPO works by training the model to prefer matched prompts over mismatched prompts, which are constructed by perturbing original captions using a large language model. Our framework is general and compatible with existing preference-based algorithms. We extend both DPO and KTO to our setting, resulting in TDPO and TKTO. Quantitative and qualitative evaluations across multiple benchmarks show that our methods consistently outperform their original counterparts, delivering better human preference scores and improved text-to-image alignment. Our Open-source code is available at https://github.com/DSL-Lab/T2I-Free-Lunch-Alignment.

Paper Structure

This paper contains 29 sections, 15 equations, 12 figures, 4 tables.

Figures (12)

  • Figure 1: Image generated by our aligned StableDiffusion 1.5 model. Notably, our model is trained on "free lunch" text preference data and does not require access to human preference data.
  • Figure 2: Overview of our Text Preference Optimization (TPO) alignment framework versus the standard Diffusion‐DPO/KTO pipeline. (Top) We leverage LLMs to perform prompt editing under four principles (content, attribute, spatial, contextual), automatically generating mismatched prompts to form winning/losing text pairs. These prompt pairs are then used to align the diffusion model via our TDPO and TKTO variants in a "free lunch" manner. (Bottom) In contrast, existing Diffusion‐DPO/KTO methods rely on costly human-annotated image preference pairs.
  • Figure 3: An example of how the four modification principles (content, attribute, spatial, contextual) are applied on a given image-prompt pair.
  • Figure 4: Finetune setup.
  • Figure 5: Side-by-Side grid comparison of the image generation using our methods and the baselines. The leftmost column is the prompts used to generate the images. The important concept or element of the prompt that the model successfully or fails to capture is printed in orange color.
  • ...and 7 more figures