Multimodal Large Language Model is a Human-Aligned Annotator for Text-to-Image Generation

Xun Wu; Shaohan Huang; Furu Wei

Multimodal Large Language Model is a Human-Aligned Annotator for Text-to-Image Generation

Xun Wu, Shaohan Huang, Furu Wei

TL;DR

To address the high cost and limited diversity of human preference data for text-to-image alignment, this paper introduces VisionPrefer, a large-scale AI-generated dataset of fine-grained preferences collected via multimodal LLM annotators across four aspects. A reward model, VP-Score, trained on VisionPrefer, closely tracks human preferences and enables RL-based fine-tuning (PPO and DPO) that improves text-image alignment and generalization across compositional prompts. The work demonstrates that AI-synthesized supervisory signals can outperform or match human-annotated data on several benchmarks, and discusses cost-efficiency and future directions for leveraging AI feedback in vision-language alignment. VisionPrefer also analyzes the impact of annotator choice, prompting strategies, and four-aspect supervision on downstream performance.

Abstract

Recent studies have demonstrated the exceptional potentials of leveraging human preference datasets to refine text-to-image generative models, enhancing the alignment between generated images and textual prompts. Despite these advances, current human preference datasets are either prohibitively expensive to construct or suffer from a lack of diversity in preference dimensions, resulting in limited applicability for instruction tuning in open-source text-to-image generative models and hinder further exploration. To address these challenges and promote the alignment of generative models through instruction tuning, we leverage multimodal large language models to create VisionPrefer, a high-quality and fine-grained preference dataset that captures multiple preference aspects. We aggregate feedback from AI annotators across four aspects: prompt-following, aesthetic, fidelity, and harmlessness to construct VisionPrefer. To validate the effectiveness of VisionPrefer, we train a reward model VP-Score over VisionPrefer to guide the training of text-to-image generative models and the preference prediction accuracy of VP-Score is comparable to human annotators. Furthermore, we use two reinforcement learning methods to supervised fine-tune generative models to evaluate the performance of VisionPrefer, and extensive experimental results demonstrate that VisionPrefer significantly improves text-image alignment in compositional image generation across diverse aspects, e.g., aesthetic, and generalizes better than previous human-preference metrics across various image distributions. Moreover, VisionPrefer indicates that the integration of AI-generated synthetic data as a supervisory signal is a promising avenue for achieving improved alignment with human preferences in vision generative models.

Multimodal Large Language Model is a Human-Aligned Annotator for Text-to-Image Generation

TL;DR

Abstract

Paper Structure (33 sections, 1 equation, 23 figures, 12 tables)

This paper contains 33 sections, 1 equation, 23 figures, 12 tables.

Introduction
Related Work
Text-to-Image Generative Models Alignment
Reinforcement Learning from AI Feedback
VisionPrefer
Experiments
Reward Modeling
Boosting Generative Models
Ablation Study
Analysis
Which MLLM is the Best Annotator?
Encouraging GPT-4 Vision for Enhanced Annotations.
Fine-Grained Feedback Leads to Better Results.
Conclusion
Additional Main Results
...and 18 more sections

Figures (23)

Figure 1: Fine-grained feedback from multimodal large language model help to yield more human-preferred images. Left: Output generated by the baseline text-to-image generative model. Right: Output generated by the baseline model optimized on our preference dataset VisionPrefer. We illustrate improvements in generation quality across four aspects: Prompt-Following, Aesthetic, Fidelity and Harmlessness. See Appendix for more examples.
Figure 1: Win rates of the generative model optimized with VP-Score compared to generative models optimized other reward models on three test benchmarks. VP-Score shows a competitive performance.
Figure 2: VisionPrefer construction pipeline. We sample textual prompts and text-to-image generative models from pools to guarantee the diversity of comparison data, then query AI annotators, GPT-4 Vision with detailed illustrations for fine-grained and high-quality annotations in both textual and numerical formats.
Figure 2: Qualitative comparison between text-to-image generative model optimized with the guidance of VP-Score and other reward models. SD 1.5 denotes the Stable Diffusion v1.5 model without any fine-tune.
Figure 3: Performance across multiple reward models during the PPO training process. All scores are normalized for a better visualization.
...and 18 more figures

Multimodal Large Language Model is a Human-Aligned Annotator for Text-to-Image Generation

TL;DR

Abstract

Multimodal Large Language Model is a Human-Aligned Annotator for Text-to-Image Generation

Authors

TL;DR

Abstract

Table of Contents

Figures (23)