A Practical Guide to Sample-based Statistical Distances for Evaluating Generative Models in Science

Sebastian Bischoff; Alana Darcher; Michael Deistler; Richard Gao; Franziska Gerken; Manuel Gloeckler; Lisa Haxel; Jaivardhan Kapoor; Janne K Lappalainen; Jakob H Macke; Guy Moss; Matthijs Pals; Felix Pei; Rachel Rapp; A Erdem Sağtekin; Cornelius Schröder; Auguste Schulz; Zinovia Stefanidi; Shoji Toyota; Linda Ulmer; Julius Vetter

A Practical Guide to Sample-based Statistical Distances for Evaluating Generative Models in Science

Sebastian Bischoff, Alana Darcher, Michael Deistler, Richard Gao, Franziska Gerken, Manuel Gloeckler, Lisa Haxel, Jaivardhan Kapoor, Janne K Lappalainen, Jakob H Macke, Guy Moss, Matthijs Pals, Felix Pei, Rachel Rapp, A Erdem Sağtekin, Cornelius Schröder, Auguste Schulz, Zinovia Stefanidi, Shoji Toyota, Linda Ulmer, Julius Vetter

TL;DR

This work aims to provide an accessible entry point to understanding popular sample-based statistical distances, requiring only foundational knowledge in mathematics and statistics, and showcases that distinct distances can give different results on similar data.

Abstract

Generative models are invaluable in many fields of science because of their ability to capture high-dimensional and complicated distributions, such as photo-realistic images, protein structures, and connectomes. How do we evaluate the samples these models generate? This work aims to provide an accessible entry point to understanding popular sample-based statistical distances, requiring only foundational knowledge in mathematics and statistics. We focus on four commonly used notions of statistical distances representing different methodologies: Using low-dimensional projections (Sliced-Wasserstein; SW), obtaining a distance using classifiers (Classifier Two-Sample Tests; C2ST), using embeddings through kernels (Maximum Mean Discrepancy; MMD), or neural networks (Fréchet Inception Distance; FID). We highlight the intuition behind each distance and explain their merits, scalability, complexity, and pitfalls. To demonstrate how these distances are used in practice, we evaluate generative models from different scientific domains, namely a model of decision-making and a model generating medical images. We showcase that distinct distances can give different results on similar data. Through this guide, we aim to help researchers to use, interpret, and evaluate statistical distances for generative models in science.

A Practical Guide to Sample-based Statistical Distances for Evaluating Generative Models in Science

TL;DR

Abstract

Paper Structure (50 sections, 28 equations, 19 figures, 4 tables)

This paper contains 50 sections, 28 equations, 19 figures, 4 tables.

Introduction
Sample-based statistical distances
Slicing-based: Sliced-Wasserstein (SW) distance
Definition of Wasserstein Distance
Sample-Based Wasserstein Distance
Slicing Wasserstein brings efficiency
Limitations
Classifier-based: Classifier Two-Sample Test (C2ST)
Common failure modes
C2ST can remain very high even for seemingly good generative models
Other C2ST variants
Kernel-based: maximum mean discrepancy (MMD)
MMD in practice
Network-based: Embedding-space measures
Limitations
...and 35 more sections

Figures (19)

Figure 1: The need for statistical distances in scientific generative modeling.(a) An example target distribution, $p_{true}(x)$, and two learned distributions ($p_1(x)$ and $p_2(x)$) of different models trained to capture $p_{true}(x)$. All three distributions share the same mean and marginal variances, despite having distinct shapes. However, an appropriate sample-based distribution distance $D$ can determine that $p_2(x)$ is more similar to $p_{true}(x)$. (b) Scientific applications often require evaluating high-dimensional distributions, such as distributions of images or tabular data. In this example, each point represents an X-ray image, where each dimension is one pixel.
Figure 2: Schematic for the Sliced-Wasserstein distance.(a) Samples from two two-dimensional distributions along with example slices. The "slicing" is done by sampling random directions from the unit sphere and projecting the samples from the higher-dimensional distribution onto that direction. (b) One-dimensional projections of the two distributions corresponding to the two random slices in (a). For each pair of projections, the empirical Wasserstein distance is computed. Unlike in higher dimensions, this can be done efficiently for one-dimensional distributions.
Figure 3: Computing Wasserstein distance. Two transport maps mapping the samples from a two-dimensional distribution $p_1$ (black) to samples from another distribution $p_2$ (blue), shown by arrows. The color of the arrow corresponds to the cost (Euclidean distance) between $x_i$ and $y_i$. (a) Randomly chosen transport map. (b) The optimal transport map, giving the smallest total cost. The total cost for the optimal map in (b) is the Wasserstein distance between these two sets of samples. Note that this schematic demonstrates transport maps for the non-sliced Wasserstein distance in two dimensions.
Figure 4: Classifier Two-Sample Test (C2ST). (a) The C2ST classifier problem: identifying the source distribution of a given sample. The optimal classifier predicts the higher-density distribution at every observed sample value, resulting in a majority of samples being correctly classified. (b) When probability densities of the distributions are not known, the optimal classifier is approximated by training a classifier, e.g., a neural network, to discriminate samples from the two distributions. (c) C2ST values vary from $0.5$ when distributions exactly overlap (left) to $1.0$ when distributions are completely separable (right).
Figure 5: Failure modes and behavior of C2ST. (a) Data (top left) and Gaussian maximum-likelihood estimate (bottom left). C2ST wrongly returns 0.5 (no difference between the densities) if too few samples are used (top right) or the neural network is poorly chosen (bottom right). (b) For high-dimensional densities, despite the marginals between data (black) and model (gray) seeming well-aligned, small differences (here a mean shift of 0.25 std. in every dimension) allow the classifier to more easily distinguish the distributions as dimensionality increases, yielding correct but surprisingly high C2ST. (c) On MNIST, the C2ST between data (top) and a Gaussian generative model (middle) as well as of a Mixture of Gaussians (MoG, bottom) is 1.0, although the MoG is perceptually more aligned with the data.
...and 14 more figures

Theorems & Definitions (4)

Definition A.1: Feature Map Definition of MMD
Definition A.2: Kernel Definition of MMD
Definition A.3: Supremum Definition of MMD
Definition A.4: Characteristic Kernel

A Practical Guide to Sample-based Statistical Distances for Evaluating Generative Models in Science

TL;DR

Abstract

A Practical Guide to Sample-based Statistical Distances for Evaluating Generative Models in Science

Authors

TL;DR

Abstract

Table of Contents

Figures (19)

Theorems & Definitions (4)