Table of Contents
Fetching ...

Split Conformal Prediction under Data Contamination

Jase Clarkson, Wenkai Xu, Mihai Cucuringu, Yvik Swan, Gesine Reinert

TL;DR

This work examines split conformal prediction under data contamination modeled as ε-Huber contamination, deriving bounds on how calibration-score contamination affects coverage and set efficiency. It introduces Contamination Robust Conformal Prediction (CRCP) to adjust classification prediction sets under label noise, with finite-sample guarantees that converge as data grow. Theoretical results leverage KS and Le Cam distances (and Wasserstein bounds) to quantify robustness and provide practical correction mechanisms. Empirical studies on synthetic data and CIFAR-10N demonstrate that standard conformal prediction over-covers under contamination, while CRCP achieves near-nominal coverage with substantially narrower intervals, highlighting its practical value in noisy-label scenarios.

Abstract

Conformal prediction is a non-parametric technique for constructing prediction intervals or sets from arbitrary predictive models under the assumption that the data is exchangeable. It is popular as it comes with theoretical guarantees on the marginal coverage of the prediction sets and the split conformal prediction variant has a very low computational cost compared to model training. We study the robustness of split conformal prediction in a data contamination setting, where we assume a small fraction of the calibration scores are drawn from a different distribution than the bulk. We quantify the impact of the corrupted data on the coverage and efficiency of the constructed sets when evaluated on "clean" test points, and verify our results with numerical experiments. Moreover, we propose an adjustment in the classification setting which we call Contamination Robust Conformal Prediction, and verify the efficacy of our approach using both synthetic and real datasets.

Split Conformal Prediction under Data Contamination

TL;DR

This work examines split conformal prediction under data contamination modeled as ε-Huber contamination, deriving bounds on how calibration-score contamination affects coverage and set efficiency. It introduces Contamination Robust Conformal Prediction (CRCP) to adjust classification prediction sets under label noise, with finite-sample guarantees that converge as data grow. Theoretical results leverage KS and Le Cam distances (and Wasserstein bounds) to quantify robustness and provide practical correction mechanisms. Empirical studies on synthetic data and CIFAR-10N demonstrate that standard conformal prediction over-covers under contamination, while CRCP achieves near-nominal coverage with substantially narrower intervals, highlighting its practical value in noisy-label scenarios.

Abstract

Conformal prediction is a non-parametric technique for constructing prediction intervals or sets from arbitrary predictive models under the assumption that the data is exchangeable. It is popular as it comes with theoretical guarantees on the marginal coverage of the prediction sets and the split conformal prediction variant has a very low computational cost compared to model training. We study the robustness of split conformal prediction in a data contamination setting, where we assume a small fraction of the calibration scores are drawn from a different distribution than the bulk. We quantify the impact of the corrupted data on the coverage and efficiency of the constructed sets when evaluated on "clean" test points, and verify our results with numerical experiments. Moreover, we propose an adjustment in the classification setting which we call Contamination Robust Conformal Prediction, and verify the efficacy of our approach using both synthetic and real datasets.
Paper Structure (16 sections, 8 theorems, 65 equations, 1 figure, 3 tables)

This paper contains 16 sections, 8 theorems, 65 equations, 1 figure, 3 tables.

Key Result

Lemma 3.1

Under the mixture model, model when $(X_{n+1}, Y_{n+1}) \sim \pi_1$, we have and $- \varepsilon \, d_{KS}(\Pi_1, \Pi_2) \leq \mathbb{P}_1( Y_{n+1} \in \widehat{C}_n(X_{n+1}) ) - (1-\alpha) \leq \frac{1}{n+1} + \varepsilon \, d_{KS}(\Pi_1, \Pi_2).$

Figures (1)

  • Figure 1: The mean and standard deviation of the coverage obtained over 100 repetitions of the regression experiment while varying $\varepsilon$ and $\sigma_2$. The mean coverage is marked by crosses; shaded regions indicate one standard deviation of the coverage; the straight horizontal line (in red) is the desired coverage level $0.9$. Left: we vary the standard deviation of the corruption $\sigma_2$ from 0 to 5, keeping $\varepsilon = 0.2$. Right: we vary the mixing proportion $\varepsilon$ from $0$ to $0.5$, keeping $\sigma_2=3.0$.

Theorems & Definitions (23)

  • Lemma 3.1
  • proof
  • Remark 3.2
  • Lemma 3.3
  • Remark 3.4
  • Proposition 3.5
  • proof
  • Remark 3.6
  • Lemma 3.7
  • proof
  • ...and 13 more