Free Lunch Alignment of Text-to-Image Diffusion Models without Preference Image Pairs
Jia Jun Cheng Xian, Muchen Li, Haotian Yang, Xin Tao, Pengfei Wan, Leonid Sigal, Renjie Liao
TL;DR
This work tackles text–image alignment for diffusion-based T2I models by removing the need for human preference image data. It introduces Text Preference Optimization (TPO), which uses LLM-generated mismatched prompts to create text-level preference pairs and tunes diffusion models via TDPO and TKTO, building on and generalizing prior Diffusion‑DPO/KTO approaches. Across multiple benchmarks, the proposed methods achieve state-of-the-art or competitive human-preference alignment without image-annotation data, demonstrating strong transferability and scalability. The approach is model-agnostic, integrates with existing RLHF pipelines, and is supported by open-source code to enable broad adoption and extension to other modalities.
Abstract
Recent advances in diffusion-based text-to-image (T2I) models have led to remarkable success in generating high-quality images from textual prompts. However, ensuring accurate alignment between the text and the generated image remains a significant challenge for state-of-the-art diffusion models. To address this, existing studies employ reinforcement learning with human feedback (RLHF) to align T2I outputs with human preferences. These methods, however, either rely directly on paired image preference data or require a learned reward function, both of which depend heavily on costly, high-quality human annotations and thus face scalability limitations. In this work, we introduce Text Preference Optimization (TPO), a framework that enables "free-lunch" alignment of T2I models, achieving alignment without the need for paired image preference data. TPO works by training the model to prefer matched prompts over mismatched prompts, which are constructed by perturbing original captions using a large language model. Our framework is general and compatible with existing preference-based algorithms. We extend both DPO and KTO to our setting, resulting in TDPO and TKTO. Quantitative and qualitative evaluations across multiple benchmarks show that our methods consistently outperform their original counterparts, delivering better human preference scores and improved text-to-image alignment. Our Open-source code is available at https://github.com/DSL-Lab/T2I-Free-Lunch-Alignment.
