Table of Contents
Fetching ...

GRADE: Quantifying Sample Diversity in Text-to-Image Models

Royi Rassin, Aviv Slobodkin, Shauli Ravfogel, Yanai Elazar, Yoav Goldberg

TL;DR

GRADE addresses the problem of evaluating semantic diversity in text-to-image models when prompts are underspecified. It combines world-knowledge-driven prompting (via LLMs) with VQA-based attribute extraction to construct concept-attribute distributions and uses normalized entropy as a diversity score, all without reference images. The study demonstrates that current models exhibit limited diversity, show frequent default behaviors, and that underspecified captions in training data contribute to this homogeneity, with an apparent inverse-scaling relationship between model size and diversity. The framework offers a practical, interpretable tool for diagnosing and guiding improvements in T2I systems, including data curation and diversity-driven training objectives, and points to future work on multi-attribute interactions and integration with existing metrics.

Abstract

We introduce GRADE, an automatic method for quantifying sample diversity in text-to-image models. Our method leverages the world knowledge embedded in large language models and visual question-answering systems to identify relevant concept-specific axes of diversity (e.g., ``shape'' for the concept ``cookie''). It then estimates frequency distributions of concepts and their attributes and quantifies diversity using entropy. We use GRADE to measure the diversity of 12 models over a total of 720K images, revealing that all models display limited variation, with clear deterioration in stronger models. Further, we find that models often exhibit default behaviors, a phenomenon where a model consistently generates concepts with the same attributes (e.g., 98% of the cookies are round). Lastly, we show that a key reason for low diversity is underspecified captions in training data. Our work proposes an automatic, semantically-driven approach to measure sample diversity and highlights the stunning homogeneity in text-to-image models.

GRADE: Quantifying Sample Diversity in Text-to-Image Models

TL;DR

GRADE addresses the problem of evaluating semantic diversity in text-to-image models when prompts are underspecified. It combines world-knowledge-driven prompting (via LLMs) with VQA-based attribute extraction to construct concept-attribute distributions and uses normalized entropy as a diversity score, all without reference images. The study demonstrates that current models exhibit limited diversity, show frequent default behaviors, and that underspecified captions in training data contribute to this homogeneity, with an apparent inverse-scaling relationship between model size and diversity. The framework offers a practical, interpretable tool for diagnosing and guiding improvements in T2I systems, including data curation and diversity-driven training objectives, and points to future work on multi-attribute interactions and integration with existing metrics.

Abstract

We introduce GRADE, an automatic method for quantifying sample diversity in text-to-image models. Our method leverages the world knowledge embedded in large language models and visual question-answering systems to identify relevant concept-specific axes of diversity (e.g., ``shape'' for the concept ``cookie''). It then estimates frequency distributions of concepts and their attributes and quantifies diversity using entropy. We use GRADE to measure the diversity of 12 models over a total of 720K images, revealing that all models display limited variation, with clear deterioration in stronger models. Further, we find that models often exhibit default behaviors, a phenomenon where a model consistently generates concepts with the same attributes (e.g., 98% of the cookies are round). Lastly, we show that a key reason for low diversity is underspecified captions in training data. Our work proposes an automatic, semantically-driven approach to measure sample diversity and highlights the stunning homogeneity in text-to-image models.

Paper Structure

This paper contains 40 sections, 4 equations, 22 figures, 14 tables.

Figures (22)

  • Figure 1: GRADE scores for T2I generations and corresponding web-search results, for three models and concepts.
  • Figure 2: Workflow of GRADE using “cookie” as input. (a) Generate prompts that mention “cookie” without specifying its attributes, and use them to generate images. (b) Formulate attribute-related questions and extract responses from the images using a VQA model. (c) Produce attribute values and map the responses to these values. (d) Quantify the diversity of the resulting attribute distributions.
  • Figure 3: Images generated with the prompt "a princess at a children's party" show differences in model diversity. From top to bottom, SD-1.4 (most diverse), SDXL, and FLUX.1-dev (least diverse). Although none are highly diverse, there is a marked difference between them. Specifically, diversity is reduced in attributes such as the ethnicities of depicted people, colors of dresses, and overall backgrounds.
  • Figure 4: (a) GRADE score in multi-prompt setting plotted against the denoiser's parameter size. To a degree, diversity deteriorates in tandem with parameter size. This effect is most apparent within every model family. (b) GRADE score in multi-prompt setting plotted against percentage of answers mapped to "none of the above". In \ref{['subsec:validating_grade']} we show 80% of which account for missing concepts in the image. Low "none of the above" values correspond to high prompt adherence. The plot suggests a tradeoff between adherence to diversity.
  • Figure 5: Difference in diversity between models. Images generated using the prompt "a bag on a cliffside". Each row corresponds to a model, top-down: SD-1.4 (most diverse), SDXL, and FLUX.1-dev (least diverse). While no model exhibits high diversity, there is a marked difference between SD-1.4 and FLUX.1-dev, with SDXL between them. Specifically, diversity is reduced in attributes such as color and placement of the bags, as well as the background.
  • ...and 17 more figures