Table of Contents
Fetching ...

Discriminative Estimation of Total Variation Distance: A Fidelity Auditor for Generative Data

Lan Tao, Shirong Xu, Chi-Hua Wang, Namjoon Suh, Guang Cheng

TL;DR

This work introduces a discriminative framework to estimate the Total Variation distance between real and synthetic data distributions by linking TV to Bayes risk in a binary classification task. It derives finite-sample convergence guarantees for Gaussian (and exponential-family) settings, showing faster TV estimation as distributions become more separable via a rate like $O\left(\left(\frac{d\log n}{n}\right)^{\frac{\gamma+1}{\gamma+2}}\right)$, and highlights a distinctive robustness to dimensionality. Empirical results on simulated data demonstrate that the proposed DisE estimator outperforms KDE-based and NN-based baselines, with improved accuracy particularly at higher dimensions, and a MNIST-based study confirms DisE’s ability to rank generative fidelity consistently across embedding schemes. The method offers a principled, scalable fidelity auditor for generative data and points to extending discriminative TV estimation to other divergences for broader auditing tasks.

Abstract

With the proliferation of generative AI and the increasing volume of generative data (also called as synthetic data), assessing the fidelity of generative data has become a critical concern. In this paper, we propose a discriminative approach to estimate the total variation (TV) distance between two distributions as an effective measure of generative data fidelity. Our method quantitatively characterizes the relation between the Bayes risk in classifying two distributions and their TV distance. Therefore, the estimation of total variation distance reduces to that of the Bayes risk. In particular, this paper establishes theoretical results regarding the convergence rate of the estimation error of TV distance between two Gaussian distributions. We demonstrate that, with a specific choice of hypothesis class in classification, a fast convergence rate in estimating the TV distance can be achieved. Specifically, the estimation accuracy of the TV distance is proven to inherently depend on the separation of two Gaussian distributions: smaller estimation errors are achieved when the two Gaussian distributions are farther apart. This phenomenon is also validated empirically through extensive simulations. In the end, we apply this discriminative estimation method to rank fidelity of synthetic image data using the MNIST dataset.

Discriminative Estimation of Total Variation Distance: A Fidelity Auditor for Generative Data

TL;DR

This work introduces a discriminative framework to estimate the Total Variation distance between real and synthetic data distributions by linking TV to Bayes risk in a binary classification task. It derives finite-sample convergence guarantees for Gaussian (and exponential-family) settings, showing faster TV estimation as distributions become more separable via a rate like , and highlights a distinctive robustness to dimensionality. Empirical results on simulated data demonstrate that the proposed DisE estimator outperforms KDE-based and NN-based baselines, with improved accuracy particularly at higher dimensions, and a MNIST-based study confirms DisE’s ability to rank generative fidelity consistently across embedding schemes. The method offers a principled, scalable fidelity auditor for generative data and points to extending discriminative TV estimation to other divergences for broader auditing tasks.

Abstract

With the proliferation of generative AI and the increasing volume of generative data (also called as synthetic data), assessing the fidelity of generative data has become a critical concern. In this paper, we propose a discriminative approach to estimate the total variation (TV) distance between two distributions as an effective measure of generative data fidelity. Our method quantitatively characterizes the relation between the Bayes risk in classifying two distributions and their TV distance. Therefore, the estimation of total variation distance reduces to that of the Bayes risk. In particular, this paper establishes theoretical results regarding the convergence rate of the estimation error of TV distance between two Gaussian distributions. We demonstrate that, with a specific choice of hypothesis class in classification, a fast convergence rate in estimating the TV distance can be achieved. Specifically, the estimation accuracy of the TV distance is proven to inherently depend on the separation of two Gaussian distributions: smaller estimation errors are achieved when the two Gaussian distributions are farther apart. This phenomenon is also validated empirically through extensive simulations. In the end, we apply this discriminative estimation method to rank fidelity of synthetic image data using the MNIST dataset.
Paper Structure (24 sections, 5 theorems, 66 equations, 5 figures, 5 tables)

This paper contains 24 sections, 5 theorems, 66 equations, 5 figures, 5 tables.

Key Result

Lemma 3.2

Given that $\mathcal{D} \sim \frac{1}{2}N(\bm{\mu}_1,\bm{\Sigma}_1)+\frac{1}{2}N(\bm{\mu}_2,\bm{\Sigma}_2)$, the Bayes decision rule (optimal classifier) for determining the true distribution of a given sample $\bm{x}$ is where $\mathrm{det}(\cdot)$ denotes the determinant of a matrix.

Figures (5)

  • Figure 1: In this case, the supports of $\mathbb{P}$ and $\mathbb{Q}$ are completely non-overlapping, and hence Assumption \ref{['Ass:Low_noise']} holds with $C_0=0$ and any $\gamma>0$. It is evident that the estimation error in (\ref{['TV_Bound']}) is zero due to the disjoint nature of the histograms for any value of $n$ in this example.
  • Figure 2: True total variation ($x$-axis) versus estimated total variation ($y$-axis) in cases $(n,p) \in \{10^3, 10^4\} \times\{5,10\}$ under varying disparity between two Gaussian distributions.
  • Figure 3: The robustness of estimation errors of all methods with respect to data dimensionality.
  • Figure 4: The robustness of estimation errors of all methods with respect to noise added to data (dimension = $5$).
  • Figure 5: 25 synthetic images generated by GANs after 100, 300, and 500 epochs of training are displayed from left to right.

Theorems & Definitions (5)

  • Lemma 3.2
  • Lemma 3.3
  • Theorem 3.4
  • Theorem 3.5
  • Theorem 3.6