Table of Contents
Fetching ...

Benchmarking Diversity in Image Generation via Attribute-Conditional Human Evaluation

Isabela Albuquerque, Ira Ktena, Olivia Wiles, Ivana Kajić, Amal Rannen-Triki, Cristina Vasconcelos, Aida Nematzadeh

TL;DR

This work tackles the challenge of measuring diversity in text-to-image generation by proposing an attribute-focused evaluation framework that explicitly defines concepts and factors of variation. It introduces a prompt set generated via large-language models, a bespoke human evaluation template, and a binomial-rank mechanism to compare models, complemented by an analysis of autoevaluation via the Vendi Score across multiple embeddings. Empirical results across five prominent T2I models reveal that Imagen 3 and Flux 1.1 exhibit strong attribute diversity, while autoevaluation achieves substantial alignment with human judgments depending on the embedding and conditioning used. The framework also evaluates the sufficiency of the prompt set and explores Gemini-based autoraters, culminating in a dataset and methodology intended to guide future metric development and diversity-focused improvements in T2I systems.

Abstract

Despite advances in generation quality, current text-to-image (T2I) models often lack diversity, generating homogeneous outputs. This work introduces a framework to address the need for robust diversity evaluation in T2I models. Our framework systematically assesses diversity by evaluating individual concepts and their relevant factors of variation. Key contributions include: (1) a novel human evaluation template for nuanced diversity assessment; (2) a curated prompt set covering diverse concepts with their identified factors of variation (e.g. prompt: An image of an apple, factor of variation: color); and (3) a methodology for comparing models in terms of human annotations via binomial tests. Furthermore, we rigorously compare various image embeddings for diversity measurement. Notably, our principled approach enables ranking of T2I models by diversity, identifying categories where they particularly struggle. This research offers a robust methodology and insights, paving the way for improvements in T2I model diversity and metric development.

Benchmarking Diversity in Image Generation via Attribute-Conditional Human Evaluation

TL;DR

This work tackles the challenge of measuring diversity in text-to-image generation by proposing an attribute-focused evaluation framework that explicitly defines concepts and factors of variation. It introduces a prompt set generated via large-language models, a bespoke human evaluation template, and a binomial-rank mechanism to compare models, complemented by an analysis of autoevaluation via the Vendi Score across multiple embeddings. Empirical results across five prominent T2I models reveal that Imagen 3 and Flux 1.1 exhibit strong attribute diversity, while autoevaluation achieves substantial alignment with human judgments depending on the embedding and conditioning used. The framework also evaluates the sufficiency of the prompt set and explores Gemini-based autoraters, culminating in a dataset and methodology intended to guide future metric development and diversity-focused improvements in T2I systems.

Abstract

Despite advances in generation quality, current text-to-image (T2I) models often lack diversity, generating homogeneous outputs. This work introduces a framework to address the need for robust diversity evaluation in T2I models. Our framework systematically assesses diversity by evaluating individual concepts and their relevant factors of variation. Key contributions include: (1) a novel human evaluation template for nuanced diversity assessment; (2) a curated prompt set covering diverse concepts with their identified factors of variation (e.g. prompt: An image of an apple, factor of variation: color); and (3) a methodology for comparing models in terms of human annotations via binomial tests. Furthermore, we rigorously compare various image embeddings for diversity measurement. Notably, our principled approach enables ranking of T2I models by diversity, identifying categories where they particularly struggle. This research offers a robust methodology and insights, paving the way for improvements in T2I model diversity and metric development.

Paper Structure

This paper contains 31 sections, 2 equations, 21 figures, 3 tables.

Figures (21)

  • Figure 1: Evaluating diversity requires specifying both the concept being assessed and the factor of variation to reduce ambiguity in the annotation process.
  • Figure 2: Each slice represents a concept, grouped and color-coded by its overall category.
  • Figure 3: Match with the golden set depending on different set sizes.
  • Figure 4: The distribution of counts for sets of images labelled as "diverse" or "non-diverse" in the golden set for the pilot study.
  • Figure 5: Human evaluation results. (a) Inter-annotator agreement results in terms of Krippendorff's $\alpha$-reliability. (b) We compare model rankings in terms of significance in the number of wins with two-sided Binomial tests under a 95% confidence level. Each entry in the grid represents a comparison between two models. The sign indicates the model in the row is better ($>$), worse ($<$), or not significantly different (=) than the model in the column.
  • ...and 16 more figures

Theorems & Definitions (1)

  • Definition 1: Adapted from friedman2022vendi, Definition 3.1