Learning Multi-dimensional Human Preference for Text-to-Image Generation
Sixian Zhang, Bohan Wang, Junqiang Wu, Yan Li, Tingting Gao, Di Zhang, Zhongyuan Wang
TL;DR
This work addresses the misalignment between traditional statistical metrics and human preferences in text-to-image generation by introducing the Multi-dimensional Human Preference (MHP) dataset and the Multi-dimensional Preference Score (MPS). MHP provides large-scale, balanced prompts, diverse model-generated images, and four-dimensional human annotations, enabling unified learning across aesthetic, detail, semantic alignment, and overall preferences. The MPS model uses a CLIP-based architecture with a novel condition mask and cross-attention to predict scores conditioned on different preference aspects, trained with KL-divergence against human judgments. Across three datasets, MPS achieves state-of-the-art performance in both overall and multi-dimensional preference prediction, and the accompanying MPS benchmark offers a standardized, multi-faceted evaluation framework for advancing text-to-image models toward human-aligned outputs. This work has practical impact for benchmarking and guiding the development of more human-aligned generative systems.
Abstract
Current metrics for text-to-image models typically rely on statistical metrics which inadequately represent the real preference of humans. Although recent work attempts to learn these preferences via human annotated images, they reduce the rich tapestry of human preference to a single overall score. However, the preference results vary when humans evaluate images with different aspects. Therefore, to learn the multi-dimensional human preferences, we propose the Multi-dimensional Preference Score (MPS), the first multi-dimensional preference scoring model for the evaluation of text-to-image models. The MPS introduces the preference condition module upon CLIP model to learn these diverse preferences. It is trained based on our Multi-dimensional Human Preference (MHP) Dataset, which comprises 918,315 human preference choices across four dimensions (i.e., aesthetics, semantic alignment, detail quality and overall assessment) on 607,541 images. The images are generated by a wide range of latest text-to-image models. The MPS outperforms existing scoring methods across 3 datasets in 4 dimensions, enabling it a promising metric for evaluating and improving text-to-image generation.
