Table of Contents
Fetching ...

Scalable and adaptive prediction bands with kernel sum-of-squares

Louis Allain, Sébastien da Veiga, Brian Staber

TL;DR

This work tackles finite-sample, distribution-free uncertainty quantification for regression by making conformal prediction adaptive and scalable. It introduces a generalized kernel sum-of-squares framework that learns a CP score via two RKHSs, yielding prediction bands whose width adapts to data and model confidence, while maintaining marginal coverage. A representer theorem and a dual formulation enable efficient optimization on large datasets, and a novel HSIC-based criterion provides robust local-coverage-driven hyperparameter tuning for adaptivity. The approach scales to thousands of samples and outperforms or matches existing adaptive CP methods, with practical significance for high-stakes decision making where reliable uncertainty quantification and adaptivity are crucial.

Abstract

Conformal Prediction (CP) is a popular framework for constructing prediction bands with valid coverage in finite samples, while being free of any distributional assumption. A well-known limitation of conformal prediction is the lack of adaptivity, although several works introduced practically efficient alternate procedures. In this work, we build upon recent ideas that rely on recasting the CP problem as a statistical learning problem, directly targeting coverage and adaptivity. This statistical learning problem is based on reproducible kernel Hilbert spaces (RKHS) and kernel sum-of-squares (SoS) methods. First, we extend previous results with a general representer theorem and exhibit the dual formulation of the learning problem. Crucially, such dual formulation can be solved efficiently by accelerated gradient methods with several hundreds or thousands of samples, unlike previous strategies based on off-the-shelf semidefinite programming algorithms. Second, we introduce a new hyperparameter tuning strategy tailored specifically to target adaptivity through bounds on test-conditional coverage. This strategy, based on the Hilbert-Schmidt Independence Criterion (HSIC), is introduced here to tune kernel lengthscales in our framework, but has broader applicability since it could be used in any CP algorithm where the score function is learned. Finally, extensive experiments are conducted to show how our method compares to related work. All figures can be reproduced with the accompanying code.

Scalable and adaptive prediction bands with kernel sum-of-squares

TL;DR

This work tackles finite-sample, distribution-free uncertainty quantification for regression by making conformal prediction adaptive and scalable. It introduces a generalized kernel sum-of-squares framework that learns a CP score via two RKHSs, yielding prediction bands whose width adapts to data and model confidence, while maintaining marginal coverage. A representer theorem and a dual formulation enable efficient optimization on large datasets, and a novel HSIC-based criterion provides robust local-coverage-driven hyperparameter tuning for adaptivity. The approach scales to thousands of samples and outperforms or matches existing adaptive CP methods, with practical significance for high-stakes decision making where reliable uncertainty quantification and adaptivity are crucial.

Abstract

Conformal Prediction (CP) is a popular framework for constructing prediction bands with valid coverage in finite samples, while being free of any distributional assumption. A well-known limitation of conformal prediction is the lack of adaptivity, although several works introduced practically efficient alternate procedures. In this work, we build upon recent ideas that rely on recasting the CP problem as a statistical learning problem, directly targeting coverage and adaptivity. This statistical learning problem is based on reproducible kernel Hilbert spaces (RKHS) and kernel sum-of-squares (SoS) methods. First, we extend previous results with a general representer theorem and exhibit the dual formulation of the learning problem. Crucially, such dual formulation can be solved efficiently by accelerated gradient methods with several hundreds or thousands of samples, unlike previous strategies based on off-the-shelf semidefinite programming algorithms. Second, we introduce a new hyperparameter tuning strategy tailored specifically to target adaptivity through bounds on test-conditional coverage. This strategy, based on the Hilbert-Schmidt Independence Criterion (HSIC), is introduced here to tune kernel lengthscales in our framework, but has broader applicability since it could be used in any CP algorithm where the score function is learned. Finally, extensive experiments are conducted to show how our method compares to related work. All figures can be reproduced with the accompanying code.

Paper Structure

This paper contains 40 sections, 6 theorems, 66 equations, 17 figures, 1 table, 1 algorithm.

Key Result

Theorem 1

Assume $L\colon \mathbb{R}^{n}\rightarrow \mathbb{R}\cup\{+\infty\}$ to be a lower semi-continuous and bounded below loss function. equation:general problem marteau-ferey admits a solution $\mathcal{A}^{\star}$ which can be written $\mathcal{A}^{\star} = \sum_{i,i=1}^{n}B^{\star}_{ij}\phi(X_i)\phi(X

Figures (17)

  • Figure 1: Marginal impact of hyperparameters $a$, $b$, $\lambda_1$ and $\lambda_2$ over several values of $\theta^f$, test cases and random seeds on rmse, mean interval width and regularization norms.
  • Figure 2: Test case 2 with $n=100$. HSIC (left) and MI (right) criteria between $r(X,Y)$ and $f(X)$ as a function of $b$ and $\theta^f$ (confidence intervals obtained by bootstrap and optimal values of $\theta^f$ in dashed lines).
  • Figure 3: Test case 2 with $n=100$. Left: HSIC criterion between $r(X,Y)$ and $f(X)$ as a function of $b$ and $\theta^f$ (confidence intervals obtained by bootstrap and optimal values of $\theta^f$ in dashed lines). Middle / Right: optimal prediction bands with too small and optimized lengthscale, respectively.
  • Figure 4: Test case 1 with $n=100$. Adaptivity metrics and density of local coverage.
  • Figure 5: Test case 2 with $n=100$. Adaptivity metrics and density of local coverage.
  • ...and 12 more figures

Theorems & Definitions (11)

  • Theorem 1: marteauferey2020nonparametricmodelsnonnegativefunctions
  • Proposition 1: marteauferey2020nonparametricmodelsnonnegativefunctions
  • Remark 1
  • Theorem 2: Representer theorem
  • Proposition 2: Dual formulation
  • Proposition 3
  • Proposition 4
  • Remark 2
  • Definition 1: Maximum Mean Discrepancy smola2007hilbert
  • Remark 3
  • ...and 1 more