Table of Contents
Fetching ...

High-dimensional Analysis of Synthetic Data Selection

Parham Rezaei, Filip Kovacevic, Francesco Locatello, Marco Mondelli

TL;DR

The paper addresses how to optimally select synthetic data to improve real-data generalization in high-dimensional regression. By deriving deterministic equivalents for the excess risk of the min-norm estimator under under- and over-parameterized regimes, it shows that the risk depends only on covariances via $M=\Sigma_s^{1/2}\Sigma_t^{-1/2}$ and is independent of mean shifts in mixed training. It proves that covariance matching $\Sigma_s \propto \Sigma_t$ minimizes the asymptotic risk under certain conditions and demonstrates that this simple criterion matches or outperforms several sophisticated baselines across diverse datasets, models, and generative methods. The empirical results, spanning CNNs and transformers on CIFAR-10, ImageNet-100, and RxRx1, support covariance matching as a robust, scalable principle for synthetic-data augmentation with potential for broad impact in data-limited or privacy-sensitive contexts.

Abstract

Despite the progress in the development of generative models, their usefulness in creating synthetic data that improve prediction performance of classifiers has been put into question. Besides heuristic principles such as "synthetic data should be close to the real data distribution", it is actually not clear which specific properties affect the generalization error. Our paper addresses this question through the lens of high-dimensional regression. Theoretically, we show that, for linear models, the covariance shift between the target distribution and the distribution of the synthetic data affects the generalization error but, surprisingly, the mean shift does not. Furthermore we prove that, in some settings, matching the covariance of the target distribution is optimal. Remarkably, the theoretical insights from linear models carry over to deep neural networks and generative models. We empirically demonstrate that the covariance matching procedure (matching the covariance of the synthetic data with that of the data coming from the target distribution) performs well against several recent approaches for synthetic data selection, across training paradigms, architectures, datasets and generative models used for augmentation.

High-dimensional Analysis of Synthetic Data Selection

TL;DR

The paper addresses how to optimally select synthetic data to improve real-data generalization in high-dimensional regression. By deriving deterministic equivalents for the excess risk of the min-norm estimator under under- and over-parameterized regimes, it shows that the risk depends only on covariances via and is independent of mean shifts in mixed training. It proves that covariance matching minimizes the asymptotic risk under certain conditions and demonstrates that this simple criterion matches or outperforms several sophisticated baselines across diverse datasets, models, and generative methods. The empirical results, spanning CNNs and transformers on CIFAR-10, ImageNet-100, and RxRx1, support covariance matching as a robust, scalable principle for synthetic-data augmentation with potential for broad impact in data-limited or privacy-sensitive contexts.

Abstract

Despite the progress in the development of generative models, their usefulness in creating synthetic data that improve prediction performance of classifiers has been put into question. Besides heuristic principles such as "synthetic data should be close to the real data distribution", it is actually not clear which specific properties affect the generalization error. Our paper addresses this question through the lens of high-dimensional regression. Theoretically, we show that, for linear models, the covariance shift between the target distribution and the distribution of the synthetic data affects the generalization error but, surprisingly, the mean shift does not. Furthermore we prove that, in some settings, matching the covariance of the target distribution is optimal. Remarkably, the theoretical insights from linear models carry over to deep neural networks and generative models. We empirically demonstrate that the covariance matching procedure (matching the covariance of the synthetic data with that of the data coming from the target distribution) performs well against several recent approaches for synthetic data selection, across training paradigms, architectures, datasets and generative models used for augmentation.

Paper Structure

This paper contains 59 sections, 8 theorems, 236 equations, 2 figures, 12 tables.

Key Result

Theorem 4.1

Let $M=\Sigma_{s}^{1/2}\Sigma_{t}^{-1/2}$ and denote the eigenvalues of $M^\top M$ as $\lambda_1 \geq \dots \geq \lambda_p$. Then, under the assumptions from Section sec:preliminaries and the start of this section, it holds that, with high probability, where $\alpha_1$ and $\alpha_2$ are the unique positive solutions to the following two equations

Figures (2)

  • Figure 1: Excess risk using training data from $\mathcal{N}(\mu_t, \Sigma_t)$ and synthetic data from $\mathcal{N}(\mu_s, \Sigma_s)$, where $\Sigma_t, \Sigma_s$ are Kac–Murdock–Szegö matrices (Toeplitz matrices with geometrically decaying entries) with parameters $\rho_t, \rho_s$, scaled so that $\mathop{\mathrm{Tr}}\nolimits [M^\top M]=p$. We pick $\|\mu_t\|_2=\|\mu_s\|_2=2\sqrt{p}$, $\rho_t=0.9$, $p=600$, $n_t=1200$, $n_s=1200$, unless varying the parameters in the plot. Each value is computed from 100 i.i.d. trials, the error band is at 1 standard deviation, and theoretical predictions are continuous lines. Different curves correspond to different values of $\rho_s$. (a) Changing the cosine similarity of the mean does not impact the risk (here, $\Sigma_s$ is scaled by $\eta:=\rho_s$). (b) Larger $\rho_s$ gives lower risk since $\Sigma_s$ is closer to $\Sigma_t$. (c) Scaling $\Sigma_{\mathrm{s}}$ reduces the risk.
  • Figure 2: The portion of samples chosen from the set of leaked images shows that our proposed algorithm reliably selects real samples among the pool of generated examples.

Theorems & Definitions (8)

  • Theorem 4.1
  • Proposition 4.2
  • Theorem 4.3
  • Theorem 4.4
  • Theorem 4.5
  • Proposition A.1
  • Proposition A.2
  • Proposition A.3