Table of Contents
Fetching ...

Maximizing the Potential of Synthetic Data: Insights from Random Matrix Theory

Aymane El Firdoussi, Mohamed El Amine Seddik, Soufiane Hayou, Reda Alami, Ahmed Alzubaidi, Hakim Hacid

TL;DR

This work addresses when synthetic data can improve model performance in high-dimensional settings where feature distributions of synthetic data may shift from real data. It develops a Gaussian-mixture, noise-inclusive statistical model and analyzes a Ridge classifier trained on a mix of real and pruned synthetic data using random matrix theory to derive deterministic equivalents. The main contributions include a high-dimensional extension of prior results, revealing a smooth phase transition in fully synthetic scenarios and providing scalar fixed-point parameters that govern performance; the theory is validated across toy models, Amazon Reviews, MNIST, and LLM safety QA tasks. The findings offer principled guidance on synthetic data generation and verification, highlighting the critical roles of generative-model quality and pruning effectiveness for practical large-scale learning systems.

Abstract

Synthetic data has gained attention for training large language models, but poor-quality data can harm performance (see, e.g., Shumailov et al. (2023); Seddik et al. (2024)). A potential solution is data pruning, which retains only high-quality data based on a score function (human or machine feedback). Previous work Feng et al. (2024) analyzed models trained on synthetic data as sample size increases. We extend this by using random matrix theory to derive the performance of a binary classifier trained on a mix of real and pruned synthetic data in a high dimensional setting. Our findings identify conditions where synthetic data could improve performance, focusing on the quality of the generative model and verification strategy. We also show a smooth phase transition in synthetic label noise, contrasting with prior sharp behavior in infinite sample limits. Experiments with toy models and large language models validate our theoretical results.

Maximizing the Potential of Synthetic Data: Insights from Random Matrix Theory

TL;DR

This work addresses when synthetic data can improve model performance in high-dimensional settings where feature distributions of synthetic data may shift from real data. It develops a Gaussian-mixture, noise-inclusive statistical model and analyzes a Ridge classifier trained on a mix of real and pruned synthetic data using random matrix theory to derive deterministic equivalents. The main contributions include a high-dimensional extension of prior results, revealing a smooth phase transition in fully synthetic scenarios and providing scalar fixed-point parameters that govern performance; the theory is validated across toy models, Amazon Reviews, MNIST, and LLM safety QA tasks. The findings offer principled guidance on synthetic data generation and verification, highlighting the critical roles of generative-model quality and pruning effectiveness for practical large-scale learning systems.

Abstract

Synthetic data has gained attention for training large language models, but poor-quality data can harm performance (see, e.g., Shumailov et al. (2023); Seddik et al. (2024)). A potential solution is data pruning, which retains only high-quality data based on a score function (human or machine feedback). Previous work Feng et al. (2024) analyzed models trained on synthetic data as sample size increases. We extend this by using random matrix theory to derive the performance of a binary classifier trained on a mix of real and pruned synthetic data in a high dimensional setting. Our findings identify conditions where synthetic data could improve performance, focusing on the quality of the generative model and verification strategy. We also show a smooth phase transition in synthetic label noise, contrasting with prior sharp behavior in infinite sample limits. Experiments with toy models and large language models validate our theoretical results.

Paper Structure

This paper contains 55 sections, 18 theorems, 156 equations, 11 figures, 3 tables.

Key Result

Theorem 4.2

Let ${\bm{w}}$ be the Ridge classifier as defined in equation eq:w_q and suppose that Assumption assum:growth_rate holds. The decision function ${\bm{w}}^\top {\bm{x}}$, on some (real) test sample ${\bm{x}} \in {\mathcal{C}}_a$, with corresponding label $y=(-1)^a$ and independent of ${\mathbf{X}}$, where $\mu=\frac{c \Vert {\bm{\mu}} \Vert^2}{b + a \Vert {\bm{\mu}} \Vert^2}$ and Moreover, the as

Figures (11)

  • Figure 1: Illustration of the Marchenko-Pastur law: The histogram of eigenvalues of the empirical covariance matrix $\hat{{\mathbf{C}}}$ (as per equation (\ref{['eq:generative-model']})) using different values of $\hat{n}$. The histograms correspond to $p = 500$ and $\hat{n} = 500$ (in blue) and $\hat{n} = 5 \times 10^{4}$ (in red). The line plots depict the limiting Marchenko-Pastur law. As $\hat{n}$ grows, the distribution of eigenvalues shrinks towards $1$.
  • Figure 2: Behavior of $(\delta_r^*, \delta_s^*, \delta_g^*)$ in terms of the ratio $\frac{p}{n}$. For small ratio $\frac{p}{n}$, the values of $\delta_r^*, \delta_s^*, \delta_g^*$ are close to $0$. $(\delta_r^*, \delta_s^*, \delta_g^*)$ are computed by iterating the system \ref{['eq:deltas']} starting from random values.
  • Figure 3: Scatter plots correspond to empirical test accuracy while lines correspond to the theoretical counterpart as per Theorem \ref{['thm:toy-setting']}. The parameters used in this experiments are: $n = \hat{n} = 1000$, $\Vert {\bm{\mu}} \Vert = 0.7$ and $\gamma = 1$, $(\rho, \phi) = (0, 1)$ for Oracle supervision and $(\rho, \phi) = (1, 0.5)$ for the Weak supervision. The parameter $\varepsilon$ is variable depending on the proportion of synthetic data by taking it equal to the misclassification error corresponding to training a classifier on synthetic data only. As theoretically anticipated, a boost of performance is observed with synthetic data supervision while distribution shift affects negatively the performance.
  • Figure 4: Phase transition in terms of label noise as predicted by Corollary \ref{['corolary:train-synth-only']}. The critical value for $\varepsilon$ is predicted at $\varepsilon^* = ( 1 + \frac{\rho}{\phi} )^{-1}$. We fix $p = 100$ and vary $m$. The remaining parameters are $\Vert {\bm{\mu}} \Vert = 1$, $\rho = 0.3$ and $\phi = 0.8$, i.e. $\varepsilon^* = 0.73$.
  • Figure 5: Illustration of two different generation schemes for the MNIST data. Top figure: Generating MNIST-like data samples by only estimating the mean of each class $\hat{{\bm{\mu}}}_a$ for $a\in [10]$ and without estimating the covariance matrix, i.e samples here are generated through the distribution ${\mathcal{N}}(\hat{{\bm{\mu}}}_a, {\mathbf{I}}_p)$. Bottom figure: Generating samples by estimating both the mean and covariance of each class, as of our considered generative model defined in equation \ref{['eq:generative-model']}.
  • ...and 6 more figures

Theorems & Definitions (22)

  • Theorem 4.2: Theoretical performance
  • Corollary 4.3: Performance when training only on synthetic data
  • Lemma A.1: Inverse identity
  • Lemma A.2: Woodbury
  • Lemma A.3: Sherman-Morisson
  • Lemma A.4: Deterministic equivalent of ${\mathbf{Q}}$
  • proof
  • Lemma A.5: Deterministic equivalent of ${\mathbf{Q}} {\mathbf{A}} {\mathbf{Q}}$
  • proof
  • Corollary A.6: Trace identities
  • ...and 12 more