Statistical Inference for Generative Model Comparison
Zijun Gao, Yan Sun, Han Su
TL;DR
The paper tackles principled uncertainty quantification for evaluating generative models by introducing a relative score based on the KL divergence, which enables kernel-free, nuisance-free comparison between models. It develops an unbiased, first-order U-statistic estimator for the relative score and proves asymptotic normality, with a plug-in variance for confidence intervals. To improve finite-sample performance, it augments the theory with Edgeworth expansions, yielding more accurate coverage, especially for small sample sizes, and it extends the framework to conditional generative models. Empirically, the method achieves faithful coverage and competitive power on simulated data and real image/text datasets (e.g., CIFAR-10, Wikitext-2, TriviaQA), providing statistically valid inferences that align with benchmark metrics while supplying uncertainty quantification. This approach offers a principled, scalable way to compare generative models with statistical confidence, facilitating more robust model selection and ranking in practice.
Abstract
Generative models have achieved remarkable success across a range of applications, yet their evaluation still lacks principled uncertainty quantification. In this paper, we develop a method for comparing how close different generative models are to the underlying distribution of test samples. Particularly, our approach employs the Kullback-Leibler (KL) divergence to measure the distance between a generative model and the unknown test distribution, as KL requires no tuning parameters such as the kernels used by RKHS-based distances, and is the only $f$-divergence that admits a crucial cancellation to enable the uncertainty quantification. Furthermore, we extend our method to comparing conditional generative models and leverage Edgeworth expansions to address limited-data settings. On simulated datasets with known ground truth, we show that our approach realizes effective coverage rates, and has higher power compared to kernel-based methods. When applied to generative models on image and text datasets, our procedure yields conclusions consistent with benchmark metrics but with statistical confidence.
