Scalable and adaptive prediction bands with kernel sum-of-squares
Louis Allain, Sébastien da Veiga, Brian Staber
TL;DR
This work tackles finite-sample, distribution-free uncertainty quantification for regression by making conformal prediction adaptive and scalable. It introduces a generalized kernel sum-of-squares framework that learns a CP score via two RKHSs, yielding prediction bands whose width adapts to data and model confidence, while maintaining marginal coverage. A representer theorem and a dual formulation enable efficient optimization on large datasets, and a novel HSIC-based criterion provides robust local-coverage-driven hyperparameter tuning for adaptivity. The approach scales to thousands of samples and outperforms or matches existing adaptive CP methods, with practical significance for high-stakes decision making where reliable uncertainty quantification and adaptivity are crucial.
Abstract
Conformal Prediction (CP) is a popular framework for constructing prediction bands with valid coverage in finite samples, while being free of any distributional assumption. A well-known limitation of conformal prediction is the lack of adaptivity, although several works introduced practically efficient alternate procedures. In this work, we build upon recent ideas that rely on recasting the CP problem as a statistical learning problem, directly targeting coverage and adaptivity. This statistical learning problem is based on reproducible kernel Hilbert spaces (RKHS) and kernel sum-of-squares (SoS) methods. First, we extend previous results with a general representer theorem and exhibit the dual formulation of the learning problem. Crucially, such dual formulation can be solved efficiently by accelerated gradient methods with several hundreds or thousands of samples, unlike previous strategies based on off-the-shelf semidefinite programming algorithms. Second, we introduce a new hyperparameter tuning strategy tailored specifically to target adaptivity through bounds on test-conditional coverage. This strategy, based on the Hilbert-Schmidt Independence Criterion (HSIC), is introduced here to tune kernel lengthscales in our framework, but has broader applicability since it could be used in any CP algorithm where the score function is learned. Finally, extensive experiments are conducted to show how our method compares to related work. All figures can be reproduced with the accompanying code.
