Table of Contents
Fetching ...

An empirical study on evaluation metrics of generative adversarial networks

Qiantong Xu, Gao Huang, Yang Yuan, Chuan Guo, Yu Sun, Felix Wu, Kilian Weinberger

TL;DR

The paper systematically evaluates a set of sample-based GAN evaluation metrics, highlighting the challenges in choosing metrics that meaningfully reflect how closely generated data match real data. Through controlled experiments on CelebA and LSUN, using convolutional feature spaces (notably ResNet features), it shows that kernel MMD and the 1-NN two-sample test provide robust, efficient discrimination between real and generated distributions and are sensitive to mode collapse/drop and potential overfitting. It finds that Inception Score and Mode Score can be gamed by out-of-domain content and fail to detect overfitting, while Wasserstein distance suffers from high sample and computational costs; Fréchet Inception Distance emerges as a strong practical metric among the traditional choices. The authors recommend adopting MMD and 1-NN in a well-chosen feature space for practical GAN evaluation and provide open-source code to enable broader, standardized benchmarking.

Abstract

Evaluating generative adversarial networks (GANs) is inherently challenging. In this paper, we revisit several representative sample-based evaluation metrics for GANs, and address the problem of how to evaluate the evaluation metrics. We start with a few necessary conditions for metrics to produce meaningful scores, such as distinguishing real from generated samples, identifying mode dropping and mode collapsing, and detecting overfitting. With a series of carefully designed experiments, we comprehensively investigate existing sample-based metrics and identify their strengths and limitations in practical settings. Based on these results, we observe that kernel Maximum Mean Discrepancy (MMD) and the 1-Nearest-Neighbor (1-NN) two-sample test seem to satisfy most of the desirable properties, provided that the distances between samples are computed in a suitable feature space. Our experiments also unveil interesting properties about the behavior of several popular GAN models, such as whether they are memorizing training samples, and how far they are from learning the target distribution.

An empirical study on evaluation metrics of generative adversarial networks

TL;DR

The paper systematically evaluates a set of sample-based GAN evaluation metrics, highlighting the challenges in choosing metrics that meaningfully reflect how closely generated data match real data. Through controlled experiments on CelebA and LSUN, using convolutional feature spaces (notably ResNet features), it shows that kernel MMD and the 1-NN two-sample test provide robust, efficient discrimination between real and generated distributions and are sensitive to mode collapse/drop and potential overfitting. It finds that Inception Score and Mode Score can be gamed by out-of-domain content and fail to detect overfitting, while Wasserstein distance suffers from high sample and computational costs; Fréchet Inception Distance emerges as a strong practical metric among the traditional choices. The authors recommend adopting MMD and 1-NN in a well-chosen feature space for practical GAN evaluation and provide open-source code to enable broader, standardized benchmarking.

Abstract

Evaluating generative adversarial networks (GANs) is inherently challenging. In this paper, we revisit several representative sample-based evaluation metrics for GANs, and address the problem of how to evaluate the evaluation metrics. We start with a few necessary conditions for metrics to produce meaningful scores, such as distinguishing real from generated samples, identifying mode dropping and mode collapsing, and detecting overfitting. With a series of carefully designed experiments, we comprehensively investigate existing sample-based metrics and identify their strengths and limitations in practical settings. Based on these results, we observe that kernel Maximum Mean Discrepancy (MMD) and the 1-Nearest-Neighbor (1-NN) two-sample test seem to satisfy most of the desirable properties, provided that the distances between samples are computed in a suitable feature space. Our experiments also unveil interesting properties about the behavior of several popular GAN models, such as whether they are memorizing training samples, and how far they are from learning the target distribution.

Paper Structure

This paper contains 35 sections, 7 equations, 11 figures, 1 table.

Figures (11)

  • Figure 1: Typical sample based GAN evaluation methods.
  • Figure 2: Distinguishing a set of real images from a mixed set of real images and GAN generated images. For the metric to be discriminative, its score should increase as the fraction of generated samples in the mix increases. RIS and RMS fail as they decrease with the fraction of generated samples in $S_g$ on LSUN. Wasserstein and 1-NN accuracy (real) fail in pixel space as they do not increase.
  • Figure 3: Experiment on simulated mode collapsing. A metric score should increase to reflect the mismatch between true distribution and generated distribution as more modes are collapsed towards their cluster center. All metrics respond correctly in convolutional space. In pixel space, both Wasserstein distance and 1-NN accuracy (real) fail as they decrease in response to more collapsed clusters.
  • Figure 4: Experiment on simulated mode dropping. A metric score should increase to reflect the mismatch between true distribution and generated distribution as more modes are dropped. All metrics except RIS and RMS respond correctly, as they only increase slightly in value even when almost all modes are dropped.
  • Figure 5: Experiment on robustness of each metric to small transformations (rotations and translations). All metrics should remain constant across all mixes of real and transformed real samples, since the transformations do not alter image semantics. All metrics respond correctly in convolutional space, but not in pixel space. This experiment illustrates the unsuitability of distances in pixel space.
  • ...and 6 more figures