Statistical Inference Leveraging Synthetic Data with Distribution-Free Guarantees
Meshi Bashari, Yonghoon Lee, Roy Maor Lotan, Edgar Dobriban, Yaniv Romano
TL;DR
GESPI addresses statistical inference under data scarcity by wrapping any base inference method with a synthetic-data booster that preserves distribution-free guarantees. It operates via three runs—Base on real data, Guardrail at a relaxed error level, and Synthetic-powered on pooled data—and aggregates their outputs to bound risk at $\alpha+\varepsilon$, with improvements when the synthetic distribution closely matches the real one. The paper provides general theory (finite-sample, distribution-free guarantees with a TV-distance-based refinement), and extends to conformal prediction, conformal risk control, one-sided and multiple hypothesis testing, including a two-sided guardrail variant. Empirical results across image classification, protein structure prediction, and outlier detection demonstrate robust error control, variance reduction, and power gains when synthetic data are informative, while maintaining guarantees when they are not. This establishes a practical framework for safely exploiting synthetic data to boost sample efficiency in diverse inference tasks, with notable applications to AI model evaluation and interpretability.
Abstract
The rapid proliferation of high-quality synthetic data -- generated by advanced AI models or collected as auxiliary data from related tasks -- presents both opportunities and challenges for statistical inference. This paper introduces a GEneral Synthetic-Powered Inference (GESPI) framework that wraps around any statistical inference procedure to safely enhance sample efficiency by combining synthetic and real data. Our framework leverages high-quality synthetic data to boost statistical power, yet adaptively defaults to the standard inference method using only real data when synthetic data is of low quality. The error of our method remains below a user-specified bound without any distributional assumptions on the synthetic data, and decreases as the quality of the synthetic data improves. This flexibility enables seamless integration with conformal prediction, risk control, hypothesis testing, and multiple testing procedures, all without modifying the base inference method. We demonstrate the benefits of our method on challenging tasks with limited labeled data, including AlphaFold protein structure prediction, and comparing large reasoning models on complex math problems.
