Learning Multi-dimensional Human Preference for Text-to-Image Generation

Sixian Zhang; Bohan Wang; Junqiang Wu; Yan Li; Tingting Gao; Di Zhang; Zhongyuan Wang

Learning Multi-dimensional Human Preference for Text-to-Image Generation

Sixian Zhang, Bohan Wang, Junqiang Wu, Yan Li, Tingting Gao, Di Zhang, Zhongyuan Wang

TL;DR

This work addresses the misalignment between traditional statistical metrics and human preferences in text-to-image generation by introducing the Multi-dimensional Human Preference (MHP) dataset and the Multi-dimensional Preference Score (MPS). MHP provides large-scale, balanced prompts, diverse model-generated images, and four-dimensional human annotations, enabling unified learning across aesthetic, detail, semantic alignment, and overall preferences. The MPS model uses a CLIP-based architecture with a novel condition mask and cross-attention to predict scores conditioned on different preference aspects, trained with KL-divergence against human judgments. Across three datasets, MPS achieves state-of-the-art performance in both overall and multi-dimensional preference prediction, and the accompanying MPS benchmark offers a standardized, multi-faceted evaluation framework for advancing text-to-image models toward human-aligned outputs. This work has practical impact for benchmarking and guiding the development of more human-aligned generative systems.

Abstract

Current metrics for text-to-image models typically rely on statistical metrics which inadequately represent the real preference of humans. Although recent work attempts to learn these preferences via human annotated images, they reduce the rich tapestry of human preference to a single overall score. However, the preference results vary when humans evaluate images with different aspects. Therefore, to learn the multi-dimensional human preferences, we propose the Multi-dimensional Preference Score (MPS), the first multi-dimensional preference scoring model for the evaluation of text-to-image models. The MPS introduces the preference condition module upon CLIP model to learn these diverse preferences. It is trained based on our Multi-dimensional Human Preference (MHP) Dataset, which comprises 918,315 human preference choices across four dimensions (i.e., aesthetics, semantic alignment, detail quality and overall assessment) on 607,541 images. The images are generated by a wide range of latest text-to-image models. The MPS outperforms existing scoring methods across 3 datasets in 4 dimensions, enabling it a promising metric for evaluating and improving text-to-image generation.

Learning Multi-dimensional Human Preference for Text-to-Image Generation

TL;DR

Abstract

Paper Structure (22 sections, 4 equations, 6 figures, 5 tables)

This paper contains 22 sections, 4 equations, 6 figures, 5 tables.

Introduction
Related work
Text-to-image Generation and Evaluation
Learning human preferences
MHP Dataset
Prompt collection and annotation
Image collection and annotation
Statistics
Multi-dimensional Preference Prediction
Model Structure of MPS
Training
Experiments
Experimental Setup
Preference condition setting.
Evaluation setting.
...and 7 more sections

Figures (6)

Figure 1: As humans evaluate images from different perspectives, their preference for the images also varies. Specifically, when examining the images in the top row, the image on the left stands out in terms of aesthetic appeal, though it falls short in semantic alignment (e.g., two boats on the river) compared to its counterpart on the right. In the case of the bottom row, both images are aesthetically pleasing, yet the image on the right is marred by poor detail quality (e.g., as signified by the red bounding boxes around the distorted hand and foot).
Figure 2: Prompt collection. The initially collected prompts exhibit a long-tail distribution across various categories. After prompt augmentation with GPT, we obtain a relatively balanced prompt dataset, which contains 66,389 prompts.
Figure 3: Annotation interface. Annotators are required to evaluate the preference for the given image pair on four dimensions, including aesthetics, detail quality (detail), semantic alignment (alignment) and overall score (overall)). Annotation scores are discrete values ranging from 1 to 5, which are subsequently normalized to Boolean values of 0 or 1. When the scores are tied, the normalized score is set to 0.5.
Figure 4: The framework of Multi-dimensional Preference Score (MPS). The MPS takes the generated image, prompt and preference condition as the input, and predicts the quality (i.e. human preference) of the generated image under the given preference condition.
Figure 5: Correlation between real user preferences and model predictions. The x-axis of each subplot represents the annotated real human preferences, and the y-axis denotes the model's predictions. We examine three models: CLIP score, PickScore, and MPS (ours). Each subplot is annotated with the calculated correlation coefficient R-value, where a higher R-value indicates a closer alignment of the model's predictions with actual human preferences.
...and 1 more figures

Learning Multi-dimensional Human Preference for Text-to-Image Generation

TL;DR

Abstract

Learning Multi-dimensional Human Preference for Text-to-Image Generation

Authors

TL;DR

Abstract

Table of Contents

Figures (6)