Table of Contents
Fetching ...

Filtering with Confidence: When Data Augmentation Meets Conformal Prediction

Zixuan Wu, So Won Jeong, Yating Liu, Yeo Jin Jung, Claire Donnat

TL;DR

This work tackles synthetic data augmentation by introducing conformal data augmentation, which filters generated samples with provable risk control using conditional conformal risk prediction. The method builds a two-stage pipeline: learn a quality score and calibrate an instance-specific threshold via a RKHS-based conformal predictor, enabling approximate conditional coverage without requiring model logits or retraining. It provides a practical, plug-in wrapper around existing generative augmentation, applicable to text, tabular, and image data, and demonstrates consistent improvements in $F_1$ across diverse tasks and regimes, including imbalanced and low-data settings, while offering guarantees on the number of poor inclusions. The results underscore the framework’s potential to improve robustness and diversity in augmented datasets, making synthetic data use safer and more effective in real-world applications.

Abstract

With promising empirical performance across a wide range of applications, synthetic data augmentation appears a viable solution to data scarcity and the demands of increasingly data-intensive models. Its effectiveness lies in expanding the training set in a way that reduces estimator variance while introducing only minimal bias. Controlling this bias is therefore critical: effective data augmentation should generate diverse samples from the same underlying distribution as the training set, with minimal shifts. In this paper, we propose conformal data augmentation, a principled data filtering framework that leverages the power of conformal prediction to produce diverse synthetic data while filtering out poor-quality generations with provable risk control. Our method is simple to implement, requires no access to internal model logits, nor large-scale model retraining. We demonstrate the effectiveness of our approach across multiple tasks, including topic prediction, sentiment analysis, image classification, and fraud detection, showing consistent performance improvements of up to 40 percentage points (pp) in $F_1$ score over unaugmented baselines, and 4~pp over other filtered augmentation baselines.

Filtering with Confidence: When Data Augmentation Meets Conformal Prediction

TL;DR

This work tackles synthetic data augmentation by introducing conformal data augmentation, which filters generated samples with provable risk control using conditional conformal risk prediction. The method builds a two-stage pipeline: learn a quality score and calibrate an instance-specific threshold via a RKHS-based conformal predictor, enabling approximate conditional coverage without requiring model logits or retraining. It provides a practical, plug-in wrapper around existing generative augmentation, applicable to text, tabular, and image data, and demonstrates consistent improvements in across diverse tasks and regimes, including imbalanced and low-data settings, while offering guarantees on the number of poor inclusions. The results underscore the framework’s potential to improve robustness and diversity in augmented datasets, making synthetic data use safer and more effective in real-world applications.

Abstract

With promising empirical performance across a wide range of applications, synthetic data augmentation appears a viable solution to data scarcity and the demands of increasingly data-intensive models. Its effectiveness lies in expanding the training set in a way that reduces estimator variance while introducing only minimal bias. Controlling this bias is therefore critical: effective data augmentation should generate diverse samples from the same underlying distribution as the training set, with minimal shifts. In this paper, we propose conformal data augmentation, a principled data filtering framework that leverages the power of conformal prediction to produce diverse synthetic data while filtering out poor-quality generations with provable risk control. Our method is simple to implement, requires no access to internal model logits, nor large-scale model retraining. We demonstrate the effectiveness of our approach across multiple tasks, including topic prediction, sentiment analysis, image classification, and fraud detection, showing consistent performance improvements of up to 40 percentage points (pp) in score over unaugmented baselines, and 4~pp over other filtered augmentation baselines.

Paper Structure

This paper contains 54 sections, 2 theorems, 23 equations, 13 figures, 8 tables, 1 algorithm.

Key Result

Lemma 3.1

Consider the function class $\mathcal{F}$ as defined in Equation eq:f, and assume $\mathcal{D}_{\text{calib}}\bigcup \mathcal{D}_{\text{aug}}$ are i.i.d. Suppose $\mathcal{L}_{\lambda}(\cdot, \cdot )$ is monotone (i.e. for any sets $\mathcal{S}_{i_0}^1\subseteq\mathcal{S}_{i_0}^2$, it must be the ca where $\gamma$ is the hyperparameter and $\hat{f}^{\hat{s}_{i_0}}_W\in \mathcal{F}_W$ is the fitted

Figures (13)

  • Figure 1: Illustration of the workflow in clinical disease prediction. Data augmentation candidate outputs from the generative model $h$ (GPT-4.1 nano in this example) are filtered by a quality predictor trained on $\mathcal{D}_{\text{train}}$ with a threshold calibrated by $\mathcal{D}_{\text{calib}}$. The retained output preserves the meaning of “common cold,” while the discarded output does not correspond to the intended symptom.
  • Figure 2: (a) Performance of different data augmentation methods on three tasks: diagnosis prediction, abstract topic prediction, and Twitter message sentiment analysis. Results are averaged over 20 replicates. Error bars denote the interquartile range (IQR), with centers representing the median and boundaries corresponding to the first and third quartiles. (b) Out-of-domain (OOD) evaluation on abstract topic prediction: the classifier is trained on statistical abstracts and tested on abstracts from different domains.
  • Figure 3: Examples of data generation procedures for an image of a Toucan (top row) and an image of an Arctic Fox.
  • Figure 4: Sensitivity analysis of model performance across hyperparameters $\lambda$ and $\rho$ for three datasets. Panels (a)--(c) show results for symptom diagnosis, abstract, and sentiment analysis, respectively. For each dataset, we report precision, recall, and F1-score under $\lambda \in \{0.3, 0.5, 0.7\}$ and $\rho \in \{0,1,2\}$. The results are computed based over 20 replicates. Error bars indicate the interquartile range, with centers representing the median and boundaries corresponding to the first and third quartiles.
  • Figure 5: Scatter plots comparing Gemini-Pro and Gemini-Flash scores for symptom descriptions, statistical abstracts, and Twitter messages datasets.,
  • ...and 8 more figures

Theorems & Definitions (2)

  • Lemma 3.1
  • Lemma 3.1: Coverage; cf. gibbs2025conformalcherian2024large