Table of Contents
Fetching ...

Conformal Selective Prediction with General Risk Control

Tian Bai, Ying Jin

Abstract

In deploying artificial intelligence (AI) models, selective prediction offers the option to abstain from making a prediction when uncertain about model quality. To fulfill its promise, it is crucial to enforce strict and precise error control over cases where the model is trusted. We propose Selective Conformal Risk control with E-values (SCoRE), a new framework for deriving such decisions for any trained model and any user-defined, bounded and continuously-valued risk. SCoRE offers two types of guarantees on the risk among ``positive'' cases in which the system opts to trust the model. Built upon conformal inference and hypothesis testing ideas, SCoRE first constructs a class of (generalized) e-values, which are non-negative random variables whose product with the unknown risk has expectation no greater than one. Such a property is ensured by data exchangeability without requiring any modeling assumptions. Passing these e-values on to hypothesis testing procedures, we yield the binary trust decisions with finite-sample error control. SCoRE avoids the need of uniform concentration, and can be readily extended to settings with distribution shifts. We evaluate the proposed methods with simulations and demonstrate their efficacy through applications to error management in drug discovery, health risk prediction, and large language models.

Conformal Selective Prediction with General Risk Control

Abstract

In deploying artificial intelligence (AI) models, selective prediction offers the option to abstain from making a prediction when uncertain about model quality. To fulfill its promise, it is crucial to enforce strict and precise error control over cases where the model is trusted. We propose Selective Conformal Risk control with E-values (SCoRE), a new framework for deriving such decisions for any trained model and any user-defined, bounded and continuously-valued risk. SCoRE offers two types of guarantees on the risk among ``positive'' cases in which the system opts to trust the model. Built upon conformal inference and hypothesis testing ideas, SCoRE first constructs a class of (generalized) e-values, which are non-negative random variables whose product with the unknown risk has expectation no greater than one. Such a property is ensured by data exchangeability without requiring any modeling assumptions. Passing these e-values on to hypothesis testing procedures, we yield the binary trust decisions with finite-sample error control. SCoRE avoids the need of uniform concentration, and can be readily extended to settings with distribution shifts. We evaluate the proposed methods with simulations and demonstrate their efficacy through applications to error management in drug discovery, health risk prediction, and large language models.

Paper Structure

This paper contains 90 sections, 22 theorems, 196 equations, 20 figures, 1 table, 5 algorithms.

Key Result

Theorem 3.2

Suppose $E_{n+j}$ obeys Definition def:eval. Setting the trust decision as $\hat{\psi}_{n+j} = \mathop{\mathrm{\mathds{1}}}\nolimits\{E_{n+j}\geq 1/\alpha\}$ yields the marginal risk control: $\mathbb{E}[L_{n+j}\cdot \hat{\psi}_{n+j}]\leq \alpha$.

Figures (20)

  • Figure 1: Application of SCoRE. (a) Drug discovery. Left: given predictions of an unknown drug binding affinity $Y_{n+1}$, SCoRE controls the average cost $L_{n+1}\mathop{\mathrm{\mathds{1}}}\nolimits\{Y_{n+1}\leq c\}$ among the selected compounds. Right: in a real drug discovery dataset, the average cost among selected candidates (red dots below activity threshold) is below $\alpha=1$. (b) Clinical prediction. Left: SCoRE identifies predictions of health outcomes with small error $f(X_{n+1})\approx Y_{n+1}$ with MDR control, ensuring a low total squared error in deployment. Right: selection results in a semi-synthetic dataset (upper), and mean squared error per day when 50 patients await predictions every day (lower).
  • Figure 2: Visualization of the SCoRE workflow. Starting with any model outputs for unlabeled test points and a score that estimates the deployment risks, we use a set of calibration data to construct a risk-adjusted e-value for every test sample, and pass them on to hypothesis testing procedures and select test samples with reliable prediction.
  • Figure 3: SCoRE for selecting drugs with cost efficiency under covariate shift. (a) Overview: Given predicted drug activities, the goal is to identify highly active drugs with cost wastage control; SCoRE provides MDR and SDR guarantees among shortlisted drug candidates. (b) MDR control: realized MDR at various target levels in the original scale (left), total reward of selected drugs, number of selected drugs (right). (c) SDR control: realized SDR at various target levels (left), total reward of selected drugs (middle), number of selected drugs (right).
  • Figure 4: SCoRE for identifying accurate ICU stay time prediction. (a) Overview: Given model predictions, the goal is to identify predictions that are close to the unknown ICU stay time; SCoRE provides MDR and SDR guarantees among identified cases. (b) MDR control: realized MDR at various target levels (left), total reward (stay time) of deployed units, scaled by $1/m$ (middle), number of deployed units (right). (c) SDR control: realized SDR at various target levels (left), total reward of deployed units (middle), number of deployed units (right).
  • Figure 5: SCoRE for identifying semantically coherent AI-generated radiology report. (a) Overview: The goal is to identify reports close to human-expert reports; SCoRE provides MDR and SDR guarantees among identified reports. (b) MDR control: realized MDR at various target levels (left), total quality-based reward of deployed units, scaled by $1/m$ for readability (middle), number of deployed units (right). (c) SDR control: realized SDR at various target levels (left), total quality-based reward of deployed units (middle), number of deployed units (right).
  • ...and 15 more figures

Theorems & Definitions (30)

  • Definition 3.1: Risk-adjusted e-value
  • Theorem 3.2
  • Theorem 3.3
  • Remark 4.1
  • Theorem 4.2
  • Remark 4.3
  • Proposition 4.4
  • Remark 4.5
  • Theorem 4.6
  • Theorem 5.1
  • ...and 20 more