Table of Contents
Fetching ...

Semiparametric conformal prediction

Ji Won Park, Robert Tibshirani, Kyunghyun Cho

TL;DR

The paper tackles the challenge of constructing valid confidence sets for multi-target regression by modeling the joint distribution of vector non-conformity scores with nonparametric vine copulas and applying a semiparametric one-step correction to the $1-\alpha$ quantile. This yields prediction sets with asymptotically exact coverage and robustness to missing-at-random labels, while maintaining competitive efficiency. The approach integrates advanced copula-based density estimation with efficient influence-function theory to debias the target quantile, and provides both theoretical guarantees and empirical demonstrations on synthetic and real datasets. The proposed framework is versatile, scalable to high-dimensional targets, and readily applicable to diverse risk-sensitive applications where correlated prediction errors matter.

Abstract

Many risk-sensitive applications require well-calibrated prediction sets over multiple, potentially correlated target variables, for which the prediction algorithm may report correlated errors. In this work, we aim to construct the conformal prediction set accounting for the joint correlation structure of the vector-valued non-conformity scores. Drawing from the rich literature on multivariate quantiles and semiparametric statistics, we propose an algorithm to estimate the $1-α$ quantile of the scores, where $α$ is the user-specified miscoverage rate. In particular, we flexibly estimate the joint cumulative distribution function (CDF) of the scores using nonparametric vine copulas and improve the asymptotic efficiency of the quantile estimate using its influence function. The vine decomposition allows our method to scale well to a large number of targets. As well as guaranteeing asymptotically exact coverage, our method yields desired coverage and competitive efficiency on a range of real-world regression problems, including those with missing-at-random labels in the calibration set.

Semiparametric conformal prediction

TL;DR

The paper tackles the challenge of constructing valid confidence sets for multi-target regression by modeling the joint distribution of vector non-conformity scores with nonparametric vine copulas and applying a semiparametric one-step correction to the quantile. This yields prediction sets with asymptotically exact coverage and robustness to missing-at-random labels, while maintaining competitive efficiency. The approach integrates advanced copula-based density estimation with efficient influence-function theory to debias the target quantile, and provides both theoretical guarantees and empirical demonstrations on synthetic and real datasets. The proposed framework is versatile, scalable to high-dimensional targets, and readily applicable to diverse risk-sensitive applications where correlated prediction errors matter.

Abstract

Many risk-sensitive applications require well-calibrated prediction sets over multiple, potentially correlated target variables, for which the prediction algorithm may report correlated errors. In this work, we aim to construct the conformal prediction set accounting for the joint correlation structure of the vector-valued non-conformity scores. Drawing from the rich literature on multivariate quantiles and semiparametric statistics, we propose an algorithm to estimate the quantile of the scores, where is the user-specified miscoverage rate. In particular, we flexibly estimate the joint cumulative distribution function (CDF) of the scores using nonparametric vine copulas and improve the asymptotic efficiency of the quantile estimate using its influence function. The vine decomposition allows our method to scale well to a large number of targets. As well as guaranteeing asymptotically exact coverage, our method yields desired coverage and competitive efficiency on a range of real-world regression problems, including those with missing-at-random labels in the calibration set.

Paper Structure

This paper contains 35 sections, 5 theorems, 90 equations, 12 figures, 6 tables, 2 algorithms.

Key Result

Theorem 1

If random vector $S$ has a a joint CDF $F$ and marginal CDFs $F_1, \dotsc, F_d$, there exists a copula $C: [0, 1]^d \to [0, 1]$ such that $F(s_1, \dotsc, s_d) = C(F_1(s_1), \dotsc, F_d(s_d))$. The copula is unique if $F_1, \dotsc, F_d$ are continuous. The associated copula density is $f(u_1, \dotsc,

Figures (12)

  • Figure 1: Top: Single-target split CP. Green dashed line in (a) marks the $1-\alpha$ quantile of calibration scores, setting the confidence interval around test predictions in (b). Bottom: Independent split CP for two targets, with $\sqrt{1-\alpha}$ quantiles marked by green dashed lines in (c). Large prediction sets for two test instances shown in (d).
  • Figure 2: Scores with an upper tail correlation and long-tailed marginals. The joint 0.9-quantile of the scores can yield more efficient prediction sets than the independent $\sqrt{0.9}$ quantiles of the scores (green dot). (a, d): Marginal empirical CDF of the scores, from which the $\sqrt{0.9}$ quantiles are extracted. (b): Scores overlaid with level curves of the joint CDF, with the black curve representing the $0.9$ level. (c): Version of (b) viewed in the copula space.
  • Figure 3: One-step corrections visualized as linear projections. With the true target estimand being $\Psi(P)$, the green solid curve is the bias $\Psi(P_t) - \Psi(P)$, with $P_t$ defined in \ref{['eq:parametric_submodel']}. The one-step estimator (purple solid line) can be viewed as approximating the slope of its tangent at $t=1$ (green dotted line) using the empirical distribution.
  • Figure 4: Predicting the 0.9 quantile of a distribution based on 20 samples. Standard CP uses the empirical estimate (dashed black). Compared to the plug-in estimate from KDE (dashed purple), the one-step corrected estimate (solid purple) is closer to the true quantile (dashed green).
  • Figure 5: One-step correction improves the coverage of the plug-in estimate at all target levels for the penicillin dataset. Error bars are stddev across ten random seeds.
  • ...and 7 more figures

Theorems & Definitions (12)

  • Definition 1: Marginal validity
  • Definition 2: Copula
  • Theorem 1: Sklar's theorem sklar1959fonctions
  • Definition 3: Efficient influence function
  • Definition 4
  • Theorem 2: Asymptotic consistency
  • Theorem 3: Approximate validity
  • Theorem 4: Existence of a regular vine distribution bedford2002vines
  • proof
  • proof
  • ...and 2 more