Exposing flaws of generative model evaluation metrics and their unfair treatment of diffusion models

George Stein; Jesse C. Cresswell; Rasa Hosseinzadeh; Yi Sui; Brendan Leigh Ross; Valentin Villecroze; Zhaoyan Liu; Anthony L. Caterini; J. Eric T. Taylor; Gabriel Loaiza-Ganem

Exposing flaws of generative model evaluation metrics and their unfair treatment of diffusion models

George Stein, Jesse C. Cresswell, Rasa Hosseinzadeh, Yi Sui, Brendan Leigh Ross, Valentin Villecroze, Zhaoyan Liu, Anthony L. Caterini, J. Eric T. Taylor, Gabriel Loaiza-Ganem

TL;DR

This study reveals that widely used metrics like FID, which rely on Inception-style features, do not reliably reflect human perceptions of realism and unfairly penalize diffusion models. By evaluating a broad set of encoders, metrics, and datasets, the authors show that self-supervised, ViT-based representations (notably DINOv2-ViT-L/14) align better with human judgments, mitigating the diffusion–FID discrepancy. They also demonstrate that memorization is dataset-dependent and that existing memorization metrics fail to robustly detect it, advocating for direct memorization detection and broader reporting. The work provides a comprehensive, public benchmark and tooling to advance fairer, more informative evaluation of generative models.

Abstract

We systematically study a wide variety of generative models spanning semantically-diverse image datasets to understand and improve the feature extractors and metrics used to evaluate them. Using best practices in psychophysics, we measure human perception of image realism for generated samples by conducting the largest experiment evaluating generative models to date, and find that no existing metric strongly correlates with human evaluations. Comparing to 17 modern metrics for evaluating the overall performance, fidelity, diversity, rarity, and memorization of generative models, we find that the state-of-the-art perceptual realism of diffusion models as judged by humans is not reflected in commonly reported metrics such as FID. This discrepancy is not explained by diversity in generated samples, though one cause is over-reliance on Inception-V3. We address these flaws through a study of alternative self-supervised feature extractors, find that the semantic information encoded by individual networks strongly depends on their training procedure, and show that DINOv2-ViT-L/14 allows for much richer evaluation of generative models. Next, we investigate data memorization, and find that generative models do memorize training examples on simple, smaller datasets like CIFAR10, but not necessarily on more complex datasets like ImageNet. However, our experiments show that current metrics do not properly detect memorization: none in the literature is able to separate memorization from other phenomena such as underfitting or mode shrinkage. To facilitate further development of generative models and their evaluation we release all generated image datasets, human evaluation data, and a modular library to compute 17 common metrics for 9 different encoders at https://github.com/layer6ai-labs/dgm-eval.

Exposing flaws of generative model evaluation metrics and their unfair treatment of diffusion models

TL;DR

Abstract

Paper Structure (70 sections, 16 equations, 30 figures, 13 tables)

This paper contains 70 sections, 16 equations, 30 figures, 13 tables.

Introduction
Datasets, metrics, and encoders
Human evaluation of generated data
Improved representation spaces for generative evaluation
Qualitative examination of perceptual spaces
The (mis)alignment of evaluation metrics and human assessment
Alternative explanations: diversity, rarity, and memorization
Diversity and rarity
Memorization
Conclusions
Broader impact
Limitations
Compute
Assets and code
Generated datasets
...and 55 more sections

Figures (30)

Figure 1: An illustration of learned distributions and samples (orange, crosses) having different properties with respect to the true distribution and training set (blue, squares). Italicized text indicates metrics that purport to detect these properties.
Figure 2: Human error rate on models ranked by FID. Data is displayed as the mean across participants with error bars showing the unbiased standard error.
Figure 3: Heatmaps visualizing what the Fréchet distance "perceives" for each encoder. The sign of the heatmap is given by the activations of the saliency layer that is visualized and does not reflect the sign of the gradient w.r.t. the FD -- both bright yellow and deep blue can thus show an encoder’s focus. Additional examples are shown in Appendix \ref{['app:gradcam']}.
Figure 4: Top: Fréchet distance, precision, and human error rate for each generative model as measured by different encoders (columns) on different datasets (rows). Marker styles denote different generative techniques. Panels with a shaded background do not have strong ($|r| \geq 0.5$) and significant ($p \leq 0.05$) correlations between FD and human error rate. Bottom: Pearson correlation of metrics over the three high-resolution datasets.
Figure 5: Left: Per-class Vendi scores of ImageNet models, in decreasing order of FD$_{\hbox{DINOv2}}$ score. Right: FD$_{\hbox{DINOv2}}$ on ImageNet, coloured by average per-class Vendi score (white corresponds to the train dataset).
...and 25 more figures

Exposing flaws of generative model evaluation metrics and their unfair treatment of diffusion models

TL;DR

Abstract

Exposing flaws of generative model evaluation metrics and their unfair treatment of diffusion models

Authors

TL;DR

Abstract

Table of Contents

Figures (30)