Provably Reliable Conformal Prediction Sets in the Presence of Data Poisoning
Yan Scholten, Stephan Günnemann
TL;DR
This work tackles the fragility of conformal prediction under data poisoning by introducing Reliable Prediction Sets (RPS). RPS combines a smoothed, partitioned-training score that votes across $k_t$ classifiers with a calibration strategy that forms a majority prediction set from $k_c$ independent calibration partitions, yielding pointwise reliability certificates under worst-case training and calibration data modifications. The authors prove marginal coverage on clean data and provide explicit conditions for both coverage and size reliability under various poisoning scenarios, supported by extensive experiments on image classification benchmarks showing non-trivial robustness with manageable prediction-set sizes. The approach advances trustworthy uncertainty quantification in settings where data integrity cannot be guaranteed, with practical considerations on calibration data requirements, computational costs, and transferability to pretrained-model setups.
Abstract
Conformal prediction provides model-agnostic and distribution-free uncertainty quantification through prediction sets that are guaranteed to include the ground truth with any user-specified probability. Yet, conformal prediction is not reliable under poisoning attacks where adversaries manipulate both training and calibration data, which can significantly alter prediction sets in practice. As a solution, we propose reliable prediction sets (RPS): the first efficient method for constructing conformal prediction sets with provable reliability guarantees under poisoning. To ensure reliability under training poisoning, we introduce smoothed score functions that reliably aggregate predictions of classifiers trained on distinct partitions of the training data. To ensure reliability under calibration poisoning, we construct multiple prediction sets, each calibrated on distinct subsets of the calibration data. We then aggregate them into a majority prediction set, which includes a class only if it appears in a majority of the individual sets. Both proposed aggregations mitigate the influence of datapoints in the training and calibration data on the final prediction set. We experimentally validate our approach on image classification tasks, achieving strong reliability while maintaining utility and preserving coverage on clean data. Overall, our approach represents an important step towards more trustworthy uncertainty quantification in the presence of data poisoning.
