Using Synthetic Data to estimate the True Error is theoretically and practically doable
Hai Hoang Thanh, Duy-Tung Nguyen, Hung The Tran, Khoat Than
TL;DR
This work addresses evaluating true model error when labeled test data are scarce by deriving generalization bounds that integrate synthetic data and proposing OSYN, a method to optimize synthetic samples to maximize a proven lower bound on the true error. Theoretical contributions include a multinomial-region lower bound and an asymptotic relation showing generator quality drives bound tightness; practically, OSYN iteratively selects and refines synthetic points to tighten the bound. Empirically, OSYN yields near-oracle estimates across simulated and real tabular tasks, revealing a strong dependence on generator fidelity while remaining robust to generator variety and partition choices. The findings offer a cost-effective, theoretically justified pathway for reliable model evaluation under limited labeled data, with concrete guidance on diagnostics, parameter settings, and when to prefer synthetic-data–based evaluation. The work also notes limitations, such as lacking an upper bound and scalability challenges in high-dimensional settings, suggesting avenues for future two-sided bounds and high-dimensional extensions.
Abstract
Accurately evaluating model performance is crucial for deploying machine learning systems in real-world applications. Traditional methods often require a sufficiently large labeled test set to ensure a reliable evaluation. However, in many contexts, a large labeled dataset is costly and labor-intensive. Therefore, we sometimes have to do evaluation by a few labeled samples, which is theoretically challenging. Recent advances in generative models offer a promising alternative by enabling the synthesis of high-quality data. In this work, we make a systematic investigation about the use of synthetic data to estimate the test error of a trained model under limited labeled data conditions. To this end, we develop novel generalization bounds that take synthetic data into account. Those bounds suggest novel ways to optimize synthetic samples for evaluation and theoretically reveal the significant role of the generator's quality. Inspired by those bounds, we propose a theoretically grounded method to generate optimized synthetic data for model evaluation. Experimental results on simulation and tabular datasets demonstrate that, compared to existing baselines, our method achieves accurate and more reliable estimates of the test error.
