Table of Contents
Fetching ...

Image Generation Diversity Issues and How to Tame Them

Mischa Dombrowski, Weitong Zhang, Sarah Cechnicka, Hadrien Reynaud, Bernhard Kainz

TL;DR

The paper tackles the persistent gap between high-fidelity image generation and comprehensive distributional diversity. It introduces the Image Retrieval Score (IRS), a theoretically grounded, hyperparameter-free metric that reframes diversity evaluation as an image retrieval problem and provides confidence bounds via Stirling-based analytics. It demonstrates a pervasive measurement gap in common feature extractors, showing diffusion models fail to cover more than a fraction of the training distribution, and then proposes Diversity-Aware Diffusion Models (DiADM) that disentangle diversity from fidelity using pseudo-unconditional features. The work also offers online model rejection based on IRS, extends IRS to text-to-image with bias diagnostics, and provides a broad empirical evaluation across datasets, extractors, and model families, highlighting IRS as a powerful tool for reliable diversity assessment and guidance for model improvement.

Abstract

Generative methods now produce outputs nearly indistinguishable from real data but often fail to fully capture the data distribution. Unlike quality issues, diversity limitations in generative models are hard to detect visually, requiring specific metrics for assessment. In this paper, we draw attention to the current lack of diversity in generative models and the inability of common metrics to measure this. We achieve this by framing diversity as an image retrieval problem, where we measure how many real images can be retrieved using synthetic data as queries. This yields the Image Retrieval Score (IRS), an interpretable, hyperparameter-free metric that quantifies the diversity of a generative model's output. IRS requires only a subset of synthetic samples and provides a statistical measure of confidence. Our experiments indicate that current feature extractors commonly used in generative model assessment are inadequate for evaluating diversity effectively. Consequently, we perform an extensive search for the best feature extractors to assess diversity. Evaluation reveals that current diffusion models converge to limited subsets of the real distribution, with no current state-of-the-art models superpassing 77% of the diversity of the training data. To address this limitation, we introduce Diversity-Aware Diffusion Models (DiADM), a novel approach that improves diversity of unconditional diffusion models without loss of image quality. We do this by disentangling diversity from image quality by using a diversity aware module that uses pseudo-unconditional features as input. We provide a Python package offering unified feature extraction and metric computation to further facilitate the evaluation of generative models https://github.com/MischaD/beyondfid.

Image Generation Diversity Issues and How to Tame Them

TL;DR

The paper tackles the persistent gap between high-fidelity image generation and comprehensive distributional diversity. It introduces the Image Retrieval Score (IRS), a theoretically grounded, hyperparameter-free metric that reframes diversity evaluation as an image retrieval problem and provides confidence bounds via Stirling-based analytics. It demonstrates a pervasive measurement gap in common feature extractors, showing diffusion models fail to cover more than a fraction of the training distribution, and then proposes Diversity-Aware Diffusion Models (DiADM) that disentangle diversity from fidelity using pseudo-unconditional features. The work also offers online model rejection based on IRS, extends IRS to text-to-image with bias diagnostics, and provides a broad empirical evaluation across datasets, extractors, and model families, highlighting IRS as a powerful tool for reliable diversity assessment and guidance for model improvement.

Abstract

Generative methods now produce outputs nearly indistinguishable from real data but often fail to fully capture the data distribution. Unlike quality issues, diversity limitations in generative models are hard to detect visually, requiring specific metrics for assessment. In this paper, we draw attention to the current lack of diversity in generative models and the inability of common metrics to measure this. We achieve this by framing diversity as an image retrieval problem, where we measure how many real images can be retrieved using synthetic data as queries. This yields the Image Retrieval Score (IRS), an interpretable, hyperparameter-free metric that quantifies the diversity of a generative model's output. IRS requires only a subset of synthetic samples and provides a statistical measure of confidence. Our experiments indicate that current feature extractors commonly used in generative model assessment are inadequate for evaluating diversity effectively. Consequently, we perform an extensive search for the best feature extractors to assess diversity. Evaluation reveals that current diffusion models converge to limited subsets of the real distribution, with no current state-of-the-art models superpassing 77% of the diversity of the training data. To address this limitation, we introduce Diversity-Aware Diffusion Models (DiADM), a novel approach that improves diversity of unconditional diffusion models without loss of image quality. We do this by disentangling diversity from image quality by using a diversity aware module that uses pseudo-unconditional features as input. We provide a Python package offering unified feature extraction and metric computation to further facilitate the evaluation of generative models https://github.com/MischaD/beyondfid.

Paper Structure

This paper contains 25 sections, 1 theorem, 16 equations, 12 figures, 6 tables.

Key Result

Theorem 1.

IRS exhibits higher statistical sensitivity than FID in detecting diversity insufficiency and bias amplification in generative models.

Figures (12)

  • Figure 1: Predicted gender diversity - as one possible example for diversity - after sampling pre-trained text-to-image diffusion models infinitely often. Visually it is apparent that most terms are inherently biased toward one gender. Our proposed IRS score predicts that off-the-shelf models will only ever reach 50 percent diversity, which is equivalent to one gender being perfectly represented while the other gender is missing entirely. Using diversification strategies increases diversity. Details about this can be found in \ref{['sec:texttoimagediversity']}.
  • Figure 2: We model finding image pairs (image retrieval) as randomly drawing from urnes with replacement (Ideal). We observe that all used feature extraction models exhibit performance issues by collapsing in the feature space for real images resulting in a measurement gap affecting currently used metrics such as FID$_{\text{Inception}}$, FID$_{\text{DINOv2}}$, Precision and Recall. Comparing the results of synthetic images to real data shows that datasets generated by generative models show even stronger diversity issues which result in a synthetic data distribution gap. Our proposed metric leverages a real reference dataset to remove this measurement gap.
  • Figure 3: Overview of our proposed DiADM model. Instead of using placeholder labels for unconditional generation, we propose to leverage precomputed features from ${\mathcal{F}}$ instead. That way we can disentangle fidelity (FID) from diversity (IRS).
  • Figure 4: Visualization of the measurement gap across a diverse set of feature extractors by computing the unadjusted diversity of real data. The theoretical limit would be sampling N$_{train}$ images where all of them correspond to a different image in the training dataset. The idealized scenario follows \ref{['eq:expecteddiversity']}.
  • Figure 5: Measuring diversity of datasets by removing classes and computing IRS. If only 800 out of 1000 classes are part of the test set say that the diversity is at 80%. By manually removing ImageNet classes we can assess how good commonly used metrics are at measuring diversity compared to ours. To improve visual interpretability we normalize FID and Vendi to be within 0 and 1 with 1 being best. IRS correlates best with the real diversity measured as fraction between number of classes in test and train dataset.
  • ...and 7 more figures

Theorems & Definitions (1)

  • Theorem 1.