Filtering with Confidence: When Data Augmentation Meets Conformal Prediction
Zixuan Wu, So Won Jeong, Yating Liu, Yeo Jin Jung, Claire Donnat
TL;DR
This work tackles synthetic data augmentation by introducing conformal data augmentation, which filters generated samples with provable risk control using conditional conformal risk prediction. The method builds a two-stage pipeline: learn a quality score and calibrate an instance-specific threshold via a RKHS-based conformal predictor, enabling approximate conditional coverage without requiring model logits or retraining. It provides a practical, plug-in wrapper around existing generative augmentation, applicable to text, tabular, and image data, and demonstrates consistent improvements in $F_1$ across diverse tasks and regimes, including imbalanced and low-data settings, while offering guarantees on the number of poor inclusions. The results underscore the framework’s potential to improve robustness and diversity in augmented datasets, making synthetic data use safer and more effective in real-world applications.
Abstract
With promising empirical performance across a wide range of applications, synthetic data augmentation appears a viable solution to data scarcity and the demands of increasingly data-intensive models. Its effectiveness lies in expanding the training set in a way that reduces estimator variance while introducing only minimal bias. Controlling this bias is therefore critical: effective data augmentation should generate diverse samples from the same underlying distribution as the training set, with minimal shifts. In this paper, we propose conformal data augmentation, a principled data filtering framework that leverages the power of conformal prediction to produce diverse synthetic data while filtering out poor-quality generations with provable risk control. Our method is simple to implement, requires no access to internal model logits, nor large-scale model retraining. We demonstrate the effectiveness of our approach across multiple tasks, including topic prediction, sentiment analysis, image classification, and fraud detection, showing consistent performance improvements of up to 40 percentage points (pp) in $F_1$ score over unaugmented baselines, and 4~pp over other filtered augmentation baselines.
