Learning to Validate Generative Models: a Goodness-of-Fit Approach
Pietro Cappelli, Gaia Grosso, Marco Letizia, Humberto Reyes-González, Marco Zanetti
TL;DR
The paper tackles the challenge of validating high-dimensional generative models in scientific contexts by introducing the New Physics Learning Machine (NPLM), a density-ratio–based goodness-of-fit test grounded in Neyman–Pearson principles. NPLM models the data density as $q(x)=e^{f_w(x)}p_R(x)$ with $f_w(x)=\sum_i w_i k_\sigma(x,x_i)$, and uses a regularized logistic loss to estimate the density ratio, yielding a test statistic $t_{\hat{w}}(\mathcal{D},\mathcal{R})$ whose null distribution is approximated by a $\chi^2$ fit. The method is validated on two benchmarks: mixtures of Gaussians with normalizing flows and FlowSim jet data, showing that larger training sets improve fidelity and that the choice of reference (true data vs generator) affects the apparent strength of discrepancies. Beyond validation, NPLM provides event-level anomaly scores and diagnostic tools to localize mismodeled features, offering practical guidance for improving generative architectures and enabling reliable use of surrogates in scientific analyses.
Abstract
Generative models are increasingly central to scientific workflows, yet their systematic use and interpretation require a proper understanding of their limitations through rigorous validation. Classic approaches struggle with scalability, statistical power, or interpretability when applied to high-dimensional data, making it difficult to certify the reliability of these models in realistic, high-dimensional scientific settings. Here, we propose the use of the New Physics Learning Machine (NPLM), a learning-based approach to goodness-of-fit testing inspired by the Neyman--Pearson construction, to test generative networks trained on high-dimensional scientific data. We demonstrate the performance of NPLM for validation in two benchmark cases: generative models trained on mixtures of Gaussian models with increasing dimensionality, and a public end-to-end model, known as FlowSim, developed to generate high-energy physics collision events. We demonstrate that the NPLM can serve as a powerful validation method while also providing a means to diagnose sub-optimally modeled regions of the data.
