Table of Contents
Fetching ...

Utility Theory of Synthetic Data Generation

Shirong Xu, Will Wei Sun, Guang Cheng

TL;DR

The paper develops a rigorous utility theory for synthetic data in supervised learning, focusing on two learning-utility notions: generalization similarity and consistent model ranking between models trained on synthetic versus real data. It formulates a two-stage framework where real data trains a generative model to produce synthetic data, and downstream tasks learn from the synthetic data, introducing the utility metric U that captures differences in generalization. The authors derive analytic bounds that decompose utility into feature fidelity, estimation of the regression function, and downstream model specification, showing that perfect distribution alignment is not strictly necessary for good utility. They introduce the $(V,d)$-fidelity level to quantify distributional similarity, and prove worst-case utility bounds under a low-noise condition, highlighting when consistent model comparison can hold even with imperfect synthetic data. Empirical results on nonparametric models and MNIST validate the theory, offering guidance for designing synthetic-data pipelines and understanding when synthetic data can safely substitute real data for downstream tasks.

Abstract

Synthetic data algorithms are widely employed in industries to generate artificial data for downstream learning tasks. While existing research primarily focuses on empirically evaluating utility of synthetic data, its theoretical understanding is largely lacking. This paper bridges the practice-theory gap by establishing relevant utility theory in a statistical learning framework. It considers two utility metrics: generalization and ranking of models trained on synthetic data. The former is defined as the generalization difference between models trained on synthetic and on real data. By deriving analytical bounds for this utility metric, we demonstrate that the synthetic feature distribution does not need to be similar as that of real data for ensuring comparable generalization of synthetic models, provided proper model specifications in downstream learning tasks. The latter utility metric studies the relative performance of models trained on synthetic data. In particular, we discover that the distribution of synthetic data is not necessarily similar as the real one to ensure consistent model comparison. Interestingly, consistent model comparison is still achievable even when synthetic responses are not well generated, as long as downstream models are separable by a generalization gap. Finally, extensive experiments on non-parametric models and deep neural networks have been conducted to validate these theoretical findings.

Utility Theory of Synthetic Data Generation

TL;DR

The paper develops a rigorous utility theory for synthetic data in supervised learning, focusing on two learning-utility notions: generalization similarity and consistent model ranking between models trained on synthetic versus real data. It formulates a two-stage framework where real data trains a generative model to produce synthetic data, and downstream tasks learn from the synthetic data, introducing the utility metric U that captures differences in generalization. The authors derive analytic bounds that decompose utility into feature fidelity, estimation of the regression function, and downstream model specification, showing that perfect distribution alignment is not strictly necessary for good utility. They introduce the -fidelity level to quantify distributional similarity, and prove worst-case utility bounds under a low-noise condition, highlighting when consistent model comparison can hold even with imperfect synthetic data. Empirical results on nonparametric models and MNIST validate the theory, offering guidance for designing synthetic-data pipelines and understanding when synthetic data can safely substitute real data for downstream tasks.

Abstract

Synthetic data algorithms are widely employed in industries to generate artificial data for downstream learning tasks. While existing research primarily focuses on empirically evaluating utility of synthetic data, its theoretical understanding is largely lacking. This paper bridges the practice-theory gap by establishing relevant utility theory in a statistical learning framework. It considers two utility metrics: generalization and ranking of models trained on synthetic data. The former is defined as the generalization difference between models trained on synthetic and on real data. By deriving analytical bounds for this utility metric, we demonstrate that the synthetic feature distribution does not need to be similar as that of real data for ensuring comparable generalization of synthetic models, provided proper model specifications in downstream learning tasks. The latter utility metric studies the relative performance of models trained on synthetic data. In particular, we discover that the distribution of synthetic data is not necessarily similar as the real one to ensure consistent model comparison. Interestingly, consistent model comparison is still achievable even when synthetic responses are not well generated, as long as downstream models are separable by a generalization gap. Finally, extensive experiments on non-parametric models and deep neural networks have been conducted to validate these theoretical findings.
Paper Structure (20 sections, 10 theorems, 144 equations, 12 figures, 2 tables)

This paper contains 20 sections, 10 theorems, 144 equations, 12 figures, 2 tables.

Key Result

Theorem 1

Let $\widehat{f}$ and $\widetilde{f}$ be classifiers trained from $\mathcal{D}$ and $\widetilde{\mathcal{D}}$, respectively. It holds that where $\Delta_1 = \mathbb{E}_{\bm{X}}\left[I(\widehat{f}(\bm{X}) \neq f_{\mathcal{F}}^{\star}(\bm{X}))\right]$ and $\Delta_2 = \mathbb{E}_{\bm{X}}\left[I(\widetilde{f}(\bm{X}) \neq \widetilde{f}_{\mathcal{F}}^{\star}(\bm{X}))\right]$.

Figures (12)

  • Figure 1: In this example, the true decision boundary is nonlinear, whereas the decision boundary for the downstream task is linear. Notably, if the misclassified samples are removed, the downstream linear decision boundary in the right plot is identical to that in the left plot, as it achieves zero prediction error.
  • Figure 2: The architecture for generating and evaluating synthetic data in supervised learning. Red arrows indicate possible privacy breaches.
  • Figure 3: The behavior of $U(\widehat{f}, \widetilde{f})$ under varying training sample sizes and different generative models.
  • Figure 4: The real samples are generated similarly to those shown in Figure \ref{['fig:Bound_Con']}, and generative models are trained with sample sizes of $\{500 \times 2^i, i=0,1,2\}$. We evaluate two downstream tasks: a linear SVM (top left) and a decision tree (top right). The distributional difference between real and synthetic data is measured by the energy distance (bottom left) and the KS distance (bottom right).
  • Figure 5: The similarity between $\mathbb{P}_{\bm{X},Y}$ and $\mathbb{P}_{\widetilde{\bm{X}},\widetilde{Y}}$ can be decomposed into two key aspects: (1) feature fidelity: quantifying the proximity between $\mathbb{P}_{\bm{X}}$ and $\mathbb{P}_{\widetilde{\bm{X}}}$; and (2) functional relationship estimation: assessing the effectiveness of $\mathbb{P}_{\widetilde{Y}|\widetilde{\bm{X}}}$ in capturing the underlying functional relationship inherent in $\mathbb{P}_{Y|\bm{X}}$.
  • ...and 7 more figures

Theorems & Definitions (18)

  • Theorem 1
  • Example 1
  • Example 2
  • Definition 1
  • Theorem 2
  • Corollary 1
  • Theorem 3: Worst-Case Utility Bound
  • Theorem 4
  • Definition 2
  • Example 3
  • ...and 8 more