A Study on the Evaluation of Generative Models

Eyal Betzalel; Coby Penso; Aviv Navon; Ethan Fetaya

A Study on the Evaluation of Generative Models

Eyal Betzalel, Coby Penso, Aviv Navon, Ethan Fetaya

TL;DR

The paper tackles the challenging problem of evaluating implicit generative models that lack likelihoods. It introduces a high-quality synthetic benchmark (NotImageNet32) built from Image-GPT to enable exact likelihood-based divergences ($KL$ and $RKL$) and systematically compares them to empirical metrics like $FID$ and $IS$. The findings show that while empirical metrics correlate with probabilistic divergences, they are highly volatile and can misrank near-equivalent models, with $FID_ Infinity$ and CLIP-based features offering more reliable assessments than traditional Inception-based metrics. The authors advocate replacing Inception-based scoring with CLIP features, using unbiased and robust metrics (FID$_ Infinity$, KID, Clean FID), and adopting NotImageNet32 as a standard test-bed, accompanied by public release of their evaluation code.

Abstract

Implicit generative models, which do not return likelihood values, such as generative adversarial networks and diffusion models, have become prevalent in recent years. While it is true that these models have shown remarkable results, evaluating their performance is challenging. This issue is of vital importance to push research forward and identify meaningful gains from random noise. Currently, heuristic metrics such as the Inception score (IS) and Frechet Inception Distance (FID) are the most common evaluation metrics, but what they measure is not entirely clear. Additionally, there are questions regarding how meaningful their score actually is. In this work, we study the evaluation metrics of generative models by generating a high-quality synthetic dataset on which we can estimate classical metrics for comparison. Our study shows that while FID and IS do correlate to several f-divergences, their ranking of close models can vary considerably making them problematic when used for fain-grained comparison. We further used this experimental setting to study which evaluation metric best correlates with our probabilistic metrics. Lastly, we look into the base features used for metrics such as FID.

A Study on the Evaluation of Generative Models

TL;DR

and

) and systematically compares them to empirical metrics like

and

. The findings show that while empirical metrics correlate with probabilistic divergences, they are highly volatile and can misrank near-equivalent models, with

and CLIP-based features offering more reliable assessments than traditional Inception-based metrics. The authors advocate replacing Inception-based scoring with CLIP features, using unbiased and robust metrics (FID

, KID, Clean FID), and adopting NotImageNet32 as a standard test-bed, accompanied by public release of their evaluation code.

Abstract

Paper Structure (23 sections, 2 equations, 9 figures, 6 tables)

This paper contains 23 sections, 2 equations, 9 figures, 6 tables.

Introduction
Background
$KL$-Divergence
Inception Score
Fréchet Inception Distance
Kernel Inception Distance
FID$_\infty$ & IS$_\infty$
Clean FID
Related work
Synthetic dataset as a benchmark
Comparison between evaluation metrics
Volatility
Ranking correlation
Is Inception all we need?
Qualitative analysis of the latent representation
...and 8 more sections

Figures (9)

Figure 1: Illustration: $X$ are ImageNet images, $\hat{X}$ are synthetic images that sampled from image-GPT, $P_1(\hat{X})$ is ground truth likelihood from image-GPT for synthetic images and $P_2(\hat{X})$ is likelihood estimation of $P_1(\hat{X})$, calculated by the evaluated model, in this case, PixelSnail.
Figure 2: Examples of photos that generated by image-GPT. Each photo explicit likelihood can be measured.
Figure 3: Test KL and RKL of PixelSnail models along training.
Figure 4: Test FID and negative IS of PixelSnail models along training. We plot the negative Inception Score so lower is better for all metrics. Details on the hyperparameters summerized in the legend are in the appendix.
Figure 5: Evaluation metrics along the training of four pixelsnail and two VD-VAE models of varying sizes. We plot the negative Inception Score so lower is better for all metrics.
...and 4 more figures

A Study on the Evaluation of Generative Models

TL;DR

Abstract

A Study on the Evaluation of Generative Models

Authors

TL;DR

Abstract

Table of Contents

Figures (9)