GRADE: Quantifying Sample Diversity in Text-to-Image Models
Royi Rassin, Aviv Slobodkin, Shauli Ravfogel, Yanai Elazar, Yoav Goldberg
TL;DR
GRADE addresses the problem of evaluating semantic diversity in text-to-image models when prompts are underspecified. It combines world-knowledge-driven prompting (via LLMs) with VQA-based attribute extraction to construct concept-attribute distributions and uses normalized entropy as a diversity score, all without reference images. The study demonstrates that current models exhibit limited diversity, show frequent default behaviors, and that underspecified captions in training data contribute to this homogeneity, with an apparent inverse-scaling relationship between model size and diversity. The framework offers a practical, interpretable tool for diagnosing and guiding improvements in T2I systems, including data curation and diversity-driven training objectives, and points to future work on multi-attribute interactions and integration with existing metrics.
Abstract
We introduce GRADE, an automatic method for quantifying sample diversity in text-to-image models. Our method leverages the world knowledge embedded in large language models and visual question-answering systems to identify relevant concept-specific axes of diversity (e.g., ``shape'' for the concept ``cookie''). It then estimates frequency distributions of concepts and their attributes and quantifies diversity using entropy. We use GRADE to measure the diversity of 12 models over a total of 720K images, revealing that all models display limited variation, with clear deterioration in stronger models. Further, we find that models often exhibit default behaviors, a phenomenon where a model consistently generates concepts with the same attributes (e.g., 98% of the cookies are round). Lastly, we show that a key reason for low diversity is underspecified captions in training data. Our work proposes an automatic, semantically-driven approach to measure sample diversity and highlights the stunning homogeneity in text-to-image models.
