Refereeing the Referees: Evaluating Two-Sample Tests for Validating Generators in Precision Sciences

Samuele Grossi; Marco Letizia; Riccardo Torre

Refereeing the Referees: Evaluating Two-Sample Tests for Validating Generators in Precision Sciences

Samuele Grossi, Marco Letizia, Riccardo Torre

TL;DR

The results demonstrate that one-dimensional-based tests provide a level of sensitivity comparable to other multivariate metrics, but with significantly lower computational cost, making them ideal for evaluating generative models in high-dimensional settings.

Abstract

We propose a robust methodology to evaluate the performance and computational efficiency of non-parametric two-sample tests, specifically designed for high-dimensional generative models in scientific applications such as in particle physics. The study focuses on tests built from univariate integral probability measures: the sliced Wasserstein distance and the mean of the Kolmogorov-Smirnov statistics, already discussed in the literature, and the novel sliced Kolmogorov-Smirnov statistic. These metrics can be evaluated in parallel, allowing for fast and reliable estimates of their distribution under the null hypothesis. We also compare these metrics with the recently proposed unbiased Fréchet Gaussian Distance and the unbiased quadratic Maximum Mean Discrepancy, computed with a quartic polynomial kernel. We evaluate the proposed tests on various distributions, focusing on their sensitivity to deformations parameterized by a single parameter $ε$. Our experiments include correlated Gaussians and mixtures of Gaussians in 5, 20, and 100 dimensions, and a particle physics dataset of gluon jets from the JetNet dataset, considering both jet- and particle-level features. Our results demonstrate that one-dimensional-based tests provide a level of sensitivity comparable to other multivariate metrics, but with significantly lower computational cost, making them ideal for evaluating generative models in high-dimensional settings. This methodology offers an efficient, standardized tool for model comparison and can serve as a benchmark for more advanced tests, including machine-learning-based approaches.

Refereeing the Referees: Evaluating Two-Sample Tests for Validating Generators in Precision Sciences

TL;DR

Abstract

. Our experiments include correlated Gaussians and mixtures of Gaussians in 5, 20, and 100 dimensions, and a particle physics dataset of gluon jets from the JetNet dataset, considering both jet- and particle-level features. Our results demonstrate that one-dimensional-based tests provide a level of sensitivity comparable to other multivariate metrics, but with significantly lower computational cost, making them ideal for evaluating generative models in high-dimensional settings. This methodology offers an efficient, standardized tool for model comparison and can serve as a benchmark for more advanced tests, including machine-learning-based approaches.

Paper Structure (37 sections, 32 equations, 5 figures, 7 tables)

This paper contains 37 sections, 32 equations, 5 figures, 7 tables.

Introduction
Two-sample hypothesis testing
Test statistics
Sliced Wasserstein distance
Kolmogorov-Smirnov inspired test statistics
Mean KS
Sliced KS
Maximum Mean Discrepancy
Fréchet Gaussian Distance
Likelihood-ratio
Methodology
Datasets and analysis setup
Toy models
Mixture of Gaussians (MoG):
Correlated Gaussians (CG):
...and 22 more sections

Figures (5)

Figure 1: Left: Corner plot showing the 1D and 2D marginal probability distributions for the reference and deformed distributions for the MoG model with $d=5$ and $q=3$, and $\Sigma_{ii}$-deformation with $\epsilon=0.5$. Right: Same as the left plot but for the CG model with $d=5$. The plots are made with $10^{6}$ points per sample.
Figure 2: Color plot showing the correlation matrix for the reference (left) and deformed (right) distributions for the MoG model with $d=20$ and $q=5$, and $\Sigma_{i\neq j}$-deformation with $\epsilon=0.5$. The figure is identical in the case of the CG model, since the same correlation matrix is used for both models. The plots are made with $10^{6}$ points per sample.
Figure 3: Each pair of plots represents the empirical PDF (left) and CDF (right) of the test statistic under the null hypothesis for the MoG model with $d=20$ and $q=5$, and $n=m=5\cdot 10^{4}$ samples. See the main text for a full description of the plots.
Figure 4: Original jet kinematic distributions compared with the $\mu$, $\Sigma_{ii}$, and $\Sigma_{i\neq j}$ (left), and pow$^{+}$, pow$^{-}$, $\mathcal{N}$, and $\mathcal{U}$ (right) deformations with $\epsilon = 0.5$. The plots are made with $10^{6}$ points per sample.
Figure 5: Corner plots of the original jet kinematic distributions compared with the $\mu$ (left) and $\Sigma_{ii}$ (right) deformations with $\epsilon = 0.5$. The plots are made with $10^{6}$ points per sample.

Refereeing the Referees: Evaluating Two-Sample Tests for Validating Generators in Precision Sciences

TL;DR

Abstract

Refereeing the Referees: Evaluating Two-Sample Tests for Validating Generators in Precision Sciences

Authors

TL;DR

Abstract

Table of Contents

Figures (5)