Table of Contents
Fetching ...

Statistical Inference for Generative Model Comparison

Zijun Gao, Yan Sun, Han Su

TL;DR

The paper tackles principled uncertainty quantification for evaluating generative models by introducing a relative score based on the KL divergence, which enables kernel-free, nuisance-free comparison between models. It develops an unbiased, first-order U-statistic estimator for the relative score and proves asymptotic normality, with a plug-in variance for confidence intervals. To improve finite-sample performance, it augments the theory with Edgeworth expansions, yielding more accurate coverage, especially for small sample sizes, and it extends the framework to conditional generative models. Empirically, the method achieves faithful coverage and competitive power on simulated data and real image/text datasets (e.g., CIFAR-10, Wikitext-2, TriviaQA), providing statistically valid inferences that align with benchmark metrics while supplying uncertainty quantification. This approach offers a principled, scalable way to compare generative models with statistical confidence, facilitating more robust model selection and ranking in practice.

Abstract

Generative models have achieved remarkable success across a range of applications, yet their evaluation still lacks principled uncertainty quantification. In this paper, we develop a method for comparing how close different generative models are to the underlying distribution of test samples. Particularly, our approach employs the Kullback-Leibler (KL) divergence to measure the distance between a generative model and the unknown test distribution, as KL requires no tuning parameters such as the kernels used by RKHS-based distances, and is the only $f$-divergence that admits a crucial cancellation to enable the uncertainty quantification. Furthermore, we extend our method to comparing conditional generative models and leverage Edgeworth expansions to address limited-data settings. On simulated datasets with known ground truth, we show that our approach realizes effective coverage rates, and has higher power compared to kernel-based methods. When applied to generative models on image and text datasets, our procedure yields conclusions consistent with benchmark metrics but with statistical confidence.

Statistical Inference for Generative Model Comparison

TL;DR

The paper tackles principled uncertainty quantification for evaluating generative models by introducing a relative score based on the KL divergence, which enables kernel-free, nuisance-free comparison between models. It develops an unbiased, first-order U-statistic estimator for the relative score and proves asymptotic normality, with a plug-in variance for confidence intervals. To improve finite-sample performance, it augments the theory with Edgeworth expansions, yielding more accurate coverage, especially for small sample sizes, and it extends the framework to conditional generative models. Empirically, the method achieves faithful coverage and competitive power on simulated data and real image/text datasets (e.g., CIFAR-10, Wikitext-2, TriviaQA), providing statistically valid inferences that align with benchmark metrics while supplying uncertainty quantification. This approach offers a principled, scalable way to compare generative models with statistical confidence, facilitating more robust model selection and ranking in practice.

Abstract

Generative models have achieved remarkable success across a range of applications, yet their evaluation still lacks principled uncertainty quantification. In this paper, we develop a method for comparing how close different generative models are to the underlying distribution of test samples. Particularly, our approach employs the Kullback-Leibler (KL) divergence to measure the distance between a generative model and the unknown test distribution, as KL requires no tuning parameters such as the kernels used by RKHS-based distances, and is the only -divergence that admits a crucial cancellation to enable the uncertainty quantification. Furthermore, we extend our method to comparing conditional generative models and leverage Edgeworth expansions to address limited-data settings. On simulated datasets with known ground truth, we show that our approach realizes effective coverage rates, and has higher power compared to kernel-based methods. When applied to generative models on image and text datasets, our procedure yields conclusions consistent with benchmark metrics but with statistical confidence.

Paper Structure

This paper contains 40 sections, 8 theorems, 58 equations, 25 figures, 4 tables.

Key Result

Proposition 1

For an $f$-divergence with $f \in C^1$, if there exists a function $g$ such that for any $\hat{\mathbb{P}}_1$, $\hat{\mathbb{P}}_2$, $\mathbb{P}$, then there exists $\beta \ge 0$ such that $f(x) = \beta x \log(x)$, i.e., $D_f({\mathbb{P}} \| \hat{\mathbb{P}}_1) = \beta \mathrm{KL}({\mathbb{P}} \| \hat{\mathbb{P}}_1)$.

Figures (25)

  • Figure 1: An example of our method applied to comparing the diffusion models (DDIMs with different numbers of denoising steps $S$). Here, $\mathbb{P}$ represents the distribution of the test images, $\hat{\mathbb{P}}_1$ corresponds to the DDIM model with $S = 50$ denoising steps, and $\hat{\mathbb{P}}_2$ corresponds to the DDIM model with $S = 100$ denoising steps. Our method demonstrates that the confidence interval for the relative score $\delta(\hat{\mathbb{P}}_1, \hat{\mathbb{P}}_2)$ is significantly negative, indicating that $\hat{\mathbb{P}}_2$ with $S = 100$ achieves significantly better performance. While consistent with the FID reported in song2021denoising, our method further quantifies the statistical significance of the performance difference, which FID cannot provide.
  • Figure 2: Coverage rates and length of confidence intervals obtained by Edgeworth Expansions \ref{['eq:CI-EEs']} and Central Limit Theorem \ref{['eq:CI']} across different sample sizes $n$ using simulated data generated by \ref{['eq:simu_model_2']} with linear transformation $g(X)=X$ and $\epsilon = 0.07$.
  • Figure 3: Coverage rate and power of confidence intervals constructed by our methods \ref{['eq:CI']} and existing KL divergence and $W_2$ distance estimator paired with resampling methods (Subsampling and and Adaptive HulC). We provide two implementations of our procedure: "Ours" can access $\hat{p}_1$, $\hat{p}_2$, while "Ours (auto-encoder)" uses an auto-encoder to approximate $\hat{p}_1$, $\hat{p}_2$ (further details are provided in Section \ref{['appe:sec:simulations']} of the Supplementary Material).
  • Figure 4: Histogram of existing estimators for the $W_2$ distance and KL divergence. The vertical line indicates the true value in the simulated example.
  • Figure 5: Coverage rate and length of CIs constructed by EEs and CLT of $T$-statistics under multilayer sigmoid transformation.
  • ...and 20 more figures

Theorems & Definitions (22)

  • Proposition 1
  • Proposition 2
  • Theorem 3.1
  • Corollary 3.1
  • Corollary 3.2
  • Theorem 4.1
  • Theorem 4.2
  • Corollary 4.1
  • Remark 1
  • proof : Proof of \ref{['prop:KL.unique']}
  • ...and 12 more