Table of Contents
Fetching ...

Statistical Inference Leveraging Synthetic Data with Distribution-Free Guarantees

Meshi Bashari, Yonghoon Lee, Roy Maor Lotan, Edgar Dobriban, Yaniv Romano

TL;DR

GESPI addresses statistical inference under data scarcity by wrapping any base inference method with a synthetic-data booster that preserves distribution-free guarantees. It operates via three runs—Base on real data, Guardrail at a relaxed error level, and Synthetic-powered on pooled data—and aggregates their outputs to bound risk at $\alpha+\varepsilon$, with improvements when the synthetic distribution closely matches the real one. The paper provides general theory (finite-sample, distribution-free guarantees with a TV-distance-based refinement), and extends to conformal prediction, conformal risk control, one-sided and multiple hypothesis testing, including a two-sided guardrail variant. Empirical results across image classification, protein structure prediction, and outlier detection demonstrate robust error control, variance reduction, and power gains when synthetic data are informative, while maintaining guarantees when they are not. This establishes a practical framework for safely exploiting synthetic data to boost sample efficiency in diverse inference tasks, with notable applications to AI model evaluation and interpretability.

Abstract

The rapid proliferation of high-quality synthetic data -- generated by advanced AI models or collected as auxiliary data from related tasks -- presents both opportunities and challenges for statistical inference. This paper introduces a GEneral Synthetic-Powered Inference (GESPI) framework that wraps around any statistical inference procedure to safely enhance sample efficiency by combining synthetic and real data. Our framework leverages high-quality synthetic data to boost statistical power, yet adaptively defaults to the standard inference method using only real data when synthetic data is of low quality. The error of our method remains below a user-specified bound without any distributional assumptions on the synthetic data, and decreases as the quality of the synthetic data improves. This flexibility enables seamless integration with conformal prediction, risk control, hypothesis testing, and multiple testing procedures, all without modifying the base inference method. We demonstrate the benefits of our method on challenging tasks with limited labeled data, including AlphaFold protein structure prediction, and comparing large reasoning models on complex math problems.

Statistical Inference Leveraging Synthetic Data with Distribution-Free Guarantees

TL;DR

GESPI addresses statistical inference under data scarcity by wrapping any base inference method with a synthetic-data booster that preserves distribution-free guarantees. It operates via three runs—Base on real data, Guardrail at a relaxed error level, and Synthetic-powered on pooled data—and aggregates their outputs to bound risk at , with improvements when the synthetic distribution closely matches the real one. The paper provides general theory (finite-sample, distribution-free guarantees with a TV-distance-based refinement), and extends to conformal prediction, conformal risk control, one-sided and multiple hypothesis testing, including a two-sided guardrail variant. Empirical results across image classification, protein structure prediction, and outlier detection demonstrate robust error control, variance reduction, and power gains when synthetic data are informative, while maintaining guarantees when they are not. This establishes a practical framework for safely exploiting synthetic data to boost sample efficiency in diverse inference tasks, with notable applications to AI model evaluation and interpretability.

Abstract

The rapid proliferation of high-quality synthetic data -- generated by advanced AI models or collected as auxiliary data from related tasks -- presents both opportunities and challenges for statistical inference. This paper introduces a GEneral Synthetic-Powered Inference (GESPI) framework that wraps around any statistical inference procedure to safely enhance sample efficiency by combining synthetic and real data. Our framework leverages high-quality synthetic data to boost statistical power, yet adaptively defaults to the standard inference method using only real data when synthetic data is of low quality. The error of our method remains below a user-specified bound without any distributional assumptions on the synthetic data, and decreases as the quality of the synthetic data improves. This flexibility enables seamless integration with conformal prediction, risk control, hypothesis testing, and multiple testing procedures, all without modifying the base inference method. We demonstrate the benefits of our method on challenging tasks with limited labeled data, including AlphaFold protein structure prediction, and comparing large reasoning models on complex math problems.

Paper Structure

This paper contains 77 sections, 7 theorems, 83 equations, 40 figures, 5 tables.

Key Result

Theorem 3.3

Given $\alpha,\varepsilon >0$, suppose that algorithm $\mathrm{Alg}$ satisfies eqn:target for $\alpha$ and $\alpha+\varepsilon$, and that Condition con:target_order_formal holds. Then the algorithm $\widetilde{\mathrm{Alg}}$ defined in eqn:alg_spi satisfies for all $P,Q \in \mathcal{P}$, whereHere, $\textnormal{d}_{\textnormal{TV}}$ denotes the total variation distance.$\textnormal{d}_{\ell,\math

Figures (40)

  • Figure 1: Overview of GESPI framework.GESPI leverages a small real dataset and a large synthetic dataset. The procedure applies the base statistical method three times and aggregates the outputs in a way that guarantees error rate control while exploiting synthetic data when beneficial.
  • Figure 2: Visualization of protein structure prediction with error rate control. Panels show protein T1029 predictions with residues abstained on by (a) OnlyReal and (b) GESPI methods. Red: residues abstained on; Blue: accepted residues. Gray: real experimental structure, aligned with AlphaFold2 predicted structure. Quantitative results {abstention ratio, risk}: OnlyReal -- {$100\%$, $0\%$}; GESPI -- {$85.6\%$, $\approx7\%$}. See text in \ref{['sec:exp-protein']}.
  • Figure 3: Performance comparisons for image classification with class-conditional coverage on ImageNet. Conformal prediction methods applied at level $\alpha = 5\%$ and $\varepsilon=2\%$. FLUX-generated images serve as the synthetic data. Results are shown for selected classes; see \ref{['app-tab:img-coverage', 'app-tab:img-size']} for all classes.
  • Figure 4: Performance comparisons for protein structure prediction with error rate control. Conformal risk control methods at $\alpha=5\%$ and $15\%$. Left: average risk (fraction of residues with error $>3$Å). Right: average abstention rate.
  • Figure 5: Performance comparisons for outlier detection. Right: average FWER evaluated on three datasets: (1) Shuttle, (2) Credit-card, (3) KDDCup99, with $\alpha=15\%$. Left: empirical power versus empirical FWER on the Shuttle dataset. For both panels, the trimming proportion is $q=2.5\%$ and $\varepsilon=10\%$.
  • ...and 35 more figures

Theorems & Definitions (10)

  • Remark 3.1
  • Theorem 3.3
  • Theorem 3.4
  • Proposition E.1
  • Theorem F.2
  • Remark F.3
  • Remark F.4
  • Theorem H.1
  • Proposition H.2
  • Theorem H.3