Table of Contents
Fetching ...

Learning to Validate Generative Models: a Goodness-of-Fit Approach

Pietro Cappelli, Gaia Grosso, Marco Letizia, Humberto Reyes-González, Marco Zanetti

TL;DR

The paper tackles the challenge of validating high-dimensional generative models in scientific contexts by introducing the New Physics Learning Machine (NPLM), a density-ratio–based goodness-of-fit test grounded in Neyman–Pearson principles. NPLM models the data density as $q(x)=e^{f_w(x)}p_R(x)$ with $f_w(x)=\sum_i w_i k_\sigma(x,x_i)$, and uses a regularized logistic loss to estimate the density ratio, yielding a test statistic $t_{\hat{w}}(\mathcal{D},\mathcal{R})$ whose null distribution is approximated by a $\chi^2$ fit. The method is validated on two benchmarks: mixtures of Gaussians with normalizing flows and FlowSim jet data, showing that larger training sets improve fidelity and that the choice of reference (true data vs generator) affects the apparent strength of discrepancies. Beyond validation, NPLM provides event-level anomaly scores and diagnostic tools to localize mismodeled features, offering practical guidance for improving generative architectures and enabling reliable use of surrogates in scientific analyses.

Abstract

Generative models are increasingly central to scientific workflows, yet their systematic use and interpretation require a proper understanding of their limitations through rigorous validation. Classic approaches struggle with scalability, statistical power, or interpretability when applied to high-dimensional data, making it difficult to certify the reliability of these models in realistic, high-dimensional scientific settings. Here, we propose the use of the New Physics Learning Machine (NPLM), a learning-based approach to goodness-of-fit testing inspired by the Neyman--Pearson construction, to test generative networks trained on high-dimensional scientific data. We demonstrate the performance of NPLM for validation in two benchmark cases: generative models trained on mixtures of Gaussian models with increasing dimensionality, and a public end-to-end model, known as FlowSim, developed to generate high-energy physics collision events. We demonstrate that the NPLM can serve as a powerful validation method while also providing a means to diagnose sub-optimally modeled regions of the data.

Learning to Validate Generative Models: a Goodness-of-Fit Approach

TL;DR

The paper tackles the challenge of validating high-dimensional generative models in scientific contexts by introducing the New Physics Learning Machine (NPLM), a density-ratio–based goodness-of-fit test grounded in Neyman–Pearson principles. NPLM models the data density as with , and uses a regularized logistic loss to estimate the density ratio, yielding a test statistic whose null distribution is approximated by a fit. The method is validated on two benchmarks: mixtures of Gaussians with normalizing flows and FlowSim jet data, showing that larger training sets improve fidelity and that the choice of reference (true data vs generator) affects the apparent strength of discrepancies. Beyond validation, NPLM provides event-level anomaly scores and diagnostic tools to localize mismodeled features, offering practical guidance for improving generative architectures and enabling reliable use of surrogates in scientific analyses.

Abstract

Generative models are increasingly central to scientific workflows, yet their systematic use and interpretation require a proper understanding of their limitations through rigorous validation. Classic approaches struggle with scalability, statistical power, or interpretability when applied to high-dimensional data, making it difficult to certify the reliability of these models in realistic, high-dimensional scientific settings. Here, we propose the use of the New Physics Learning Machine (NPLM), a learning-based approach to goodness-of-fit testing inspired by the Neyman--Pearson construction, to test generative networks trained on high-dimensional scientific data. We demonstrate the performance of NPLM for validation in two benchmark cases: generative models trained on mixtures of Gaussian models with increasing dimensionality, and a public end-to-end model, known as FlowSim, developed to generate high-energy physics collision events. We demonstrate that the NPLM can serve as a powerful validation method while also providing a means to diagnose sub-optimally modeled regions of the data.

Paper Structure

This paper contains 15 sections, 13 equations, 6 figures, 4 tables.

Figures (6)

  • Figure 1: Schematic representation of the NPLM test.
  • Figure 2: Validation Z-scores of the NF, trained on the 4D-MoG with 500k samples, as function of the sample size used for the test. The blue line represents MoG as the reference case, while NF as the reference case is shown in orange.
  • Figure 3: Empirical distribution of the NPLM test ($t$) for NF models with $D=4$ and $\text{N}_{\text{tr}}=100k,\,200k,\,500k$. (shades of green). The null empirical distribution represented by the light blue histogram is compared with a $\chi^2$ distribution with $98.3$ degrees of freedom.
  • Figure 4: Z-scores of the FlowSim validation as function of the size of the analyzed FlowSim dataset. The blue line represents FullSim as the reference case, while FlowSim as the reference case is shown in orange.
  • Figure 5: Examples of the NPLM classifier score distributions evaluated on reference data and data for the NF experiments. The green and orange histograms are the output of the model trained on data from the most accurate and the least accurate NFs respectively. The grey histogram represents the mean over ten reference-distributed toys. The region in light grey covers one standard deviation around the mean.
  • ...and 1 more figures