Table of Contents
Fetching ...

Feature Likelihood Divergence: Evaluating the Generalization of Generative Models Using Samples

Marco Jiralerspong, Avishek Joey Bose, Ian Gemp, Chongli Qin, Yoram Bachrach, Gauthier Gidel

TL;DR

This work empirically demonstrate the ability of FLD to identify overfitting problem cases, even when previously proposed metrics fail, and extensively evaluate FLD on various image datasets and model classes, demonstrating its ability to match intuitions of previous metrics like FID while offering a more comprehensive evaluation of generative models.

Abstract

The past few years have seen impressive progress in the development of deep generative models capable of producing high-dimensional, complex, and photo-realistic data. However, current methods for evaluating such models remain incomplete: standard likelihood-based metrics do not always apply and rarely correlate with perceptual fidelity, while sample-based metrics, such as FID, are insensitive to overfitting, i.e., inability to generalize beyond the training set. To address these limitations, we propose a new metric called the Feature Likelihood Divergence (FLD), a parametric sample-based metric that uses density estimation to provide a comprehensive trichotomic evaluation accounting for novelty (i.e., different from the training samples), fidelity, and diversity of generated samples. We empirically demonstrate the ability of FLD to identify overfitting problem cases, even when previously proposed metrics fail. We also extensively evaluate FLD on various image datasets and model classes, demonstrating its ability to match intuitions of previous metrics like FID while offering a more comprehensive evaluation of generative models. Code is available at https://github.com/marcojira/fld.

Feature Likelihood Divergence: Evaluating the Generalization of Generative Models Using Samples

TL;DR

This work empirically demonstrate the ability of FLD to identify overfitting problem cases, even when previously proposed metrics fail, and extensively evaluate FLD on various image datasets and model classes, demonstrating its ability to match intuitions of previous metrics like FID while offering a more comprehensive evaluation of generative models.

Abstract

The past few years have seen impressive progress in the development of deep generative models capable of producing high-dimensional, complex, and photo-realistic data. However, current methods for evaluating such models remain incomplete: standard likelihood-based metrics do not always apply and rarely correlate with perceptual fidelity, while sample-based metrics, such as FID, are insensitive to overfitting, i.e., inability to generalize beyond the training set. To address these limitations, we propose a new metric called the Feature Likelihood Divergence (FLD), a parametric sample-based metric that uses density estimation to provide a comprehensive trichotomic evaluation accounting for novelty (i.e., different from the training samples), fidelity, and diversity of generated samples. We empirically demonstrate the ability of FLD to identify overfitting problem cases, even when previously proposed metrics fail. We also extensively evaluate FLD on various image datasets and model classes, demonstrating its ability to match intuitions of previous metrics like FID while offering a more comprehensive evaluation of generative models. Code is available at https://github.com/marcojira/fld.
Paper Structure (29 sections, 2 theorems, 18 equations, 18 figures, 2 tables, 1 algorithm)

This paper contains 29 sections, 2 theorems, 18 equations, 18 figures, 2 tables, 1 algorithm.

Key Result

Proposition 1

Let $D_{ij} := ||\varphi(\mathbf{x}^{\text{gen}}_j)-\varphi(\mathbf{x}^{\text{train}}_i)||^2$ be the distance between a generated sample and a train sample. Assume $\forall i, j: D_{ij} \leq \hat{D}$ with $\delta_j := \min_i D_{ij}$. Then, for any $l \in \{1,\ldots,m\}$, we have that $\hat{\sigma}_l

Figures (18)

  • Figure 1: The generative model evaluation "trichotomy": fidelity, diversity and novelty. Each metric maps to a color delimiting its criteria for evaluation.
  • Figure 2: Test FID/FLD of various models on CIFAR10. FID values are higher than usual as we use the DINOv2 feature space with the test set as reference (10k samples) instead of the usual 50k.
  • Figure 3: Steps involved in our overfit mixture of Gaussians illustrated on a 2D example
  • Figure 4: Estimated density (in purple) of the generated distribution using an MoG centered at the generated samples $\mathbf{x}^{\text{gen}}_i$ (in blue) Eq. \ref{['eq:MoG_density']}. The selection of $\sigma_i^2$ is done via Eq. \ref{['eq:cross_val']}. The training points $\mathbf{x}^{\text{train}}_i\sim p_d$, sampled from the two-moons dataset, are represented in orange. The generated points correspond to $k$ approximates copies of the training set $\mathbf{x}^{\text{gen}}_i = \mathbf{x}^{\text{train}}_i + \mathcal{N}(0,10^{-4})\,,\, i=1,\ldots,k$ and $200-k$ independent samples from the data distribution $\mathbf{x}^{\text{gen}}_i \sim p_{d}, i=k+1,\ldots,200$. The dark areas correspond to high-density values.
  • Figure 5: Starting from a set of SOTA samples produced by PFGM++, we replace each sample with a transformed copy. Left: Effect of nearly imperceptible transformations on FLD and FID (with corresponding values for various models as reference). Right: Effect of large transformations on FLD and FID.
  • ...and 13 more figures

Theorems & Definitions (5)

  • Proposition 1
  • Definition 3.1
  • Definition 3.2
  • Proposition 1
  • proof