Multimodal Large Language Model is a Human-Aligned Annotator for Text-to-Image Generation
Xun Wu, Shaohan Huang, Furu Wei
TL;DR
To address the high cost and limited diversity of human preference data for text-to-image alignment, this paper introduces VisionPrefer, a large-scale AI-generated dataset of fine-grained preferences collected via multimodal LLM annotators across four aspects. A reward model, VP-Score, trained on VisionPrefer, closely tracks human preferences and enables RL-based fine-tuning (PPO and DPO) that improves text-image alignment and generalization across compositional prompts. The work demonstrates that AI-synthesized supervisory signals can outperform or match human-annotated data on several benchmarks, and discusses cost-efficiency and future directions for leveraging AI feedback in vision-language alignment. VisionPrefer also analyzes the impact of annotator choice, prompting strategies, and four-aspect supervision on downstream performance.
Abstract
Recent studies have demonstrated the exceptional potentials of leveraging human preference datasets to refine text-to-image generative models, enhancing the alignment between generated images and textual prompts. Despite these advances, current human preference datasets are either prohibitively expensive to construct or suffer from a lack of diversity in preference dimensions, resulting in limited applicability for instruction tuning in open-source text-to-image generative models and hinder further exploration. To address these challenges and promote the alignment of generative models through instruction tuning, we leverage multimodal large language models to create VisionPrefer, a high-quality and fine-grained preference dataset that captures multiple preference aspects. We aggregate feedback from AI annotators across four aspects: prompt-following, aesthetic, fidelity, and harmlessness to construct VisionPrefer. To validate the effectiveness of VisionPrefer, we train a reward model VP-Score over VisionPrefer to guide the training of text-to-image generative models and the preference prediction accuracy of VP-Score is comparable to human annotators. Furthermore, we use two reinforcement learning methods to supervised fine-tune generative models to evaluate the performance of VisionPrefer, and extensive experimental results demonstrate that VisionPrefer significantly improves text-image alignment in compositional image generation across diverse aspects, e.g., aesthetic, and generalizes better than previous human-preference metrics across various image distributions. Moreover, VisionPrefer indicates that the integration of AI-generated synthetic data as a supervisory signal is a promising avenue for achieving improved alignment with human preferences in vision generative models.
