Conformal Tradeoffs: Guarantees Beyond Coverage

Petrus H. Zwart

Conformal Tradeoffs: Guarantees Beyond Coverage

Petrus H. Zwart

TL;DR

The paper addresses the gap where marginal coverage from deployed conformal predictors fails to specify deployment behavior over finite windows. It introduces a calibration-conditional framework with Small-Sample Beta Correction (SSBC) to map user coverage requests to explicit calibration settings, and a Calibrate-and-Audit design to produce auditable, finite-window envelopes for operational KPIs via a region--label table. It analyzes the geometry of fixed conformal partitions to reveal regime boundaries and cost-coherence constraints that govern feasible trade-offs between commitment, deferral, and error exposure. The approach is validated on Tox21 toxicity endpoints and AquaSolDB solubility planning, demonstrating practical tools for deployment-focused planning and risk management beyond mere coverage guarantees.

Abstract

Deployed conformal predictors are long-lived decision infrastructure operating over finite operational windows. The real-world question is not only ``Does the true label lie in the prediction set at the target rate?'' (marginal coverage), but ``How often does the system commit versus defer? What error exposure does it induce when it acts? How do these rates trade off?'' Marginal coverage does not determine these deployment-facing quantities: the same calibrated thresholds can yield different operational profiles depending on score geometry. We provide a framework for operational certification and planning beyond coverage with three contributions. (1) Small-Sample Beta Correction (SSBC): we invert the exact finite-sample Beta/rank law for split conformal to map a user request $(α^\star,δ)$ to a calibrated grid point with PAC-style semantics, yielding explicit finite-window coverage guarantees. (2) Calibrate-and-Audit: since no distribution-free pivot exists for rates beyond coverage, we introduce a two-stage design in which an independent audit set produces a reusable region -- label table and certified finite-window envelopes (Binomial/Beta-Binomial) for operational quantities -- commitment frequency, deferral, decisive error exposure, and commit purity -- via linear projection. (3) Geometric characterization: we describe feasibility constraints, regime boundaries (hedging vs.\ rejection), and cost-coherence conditions induced by a fixed conformal partition, explaining why operational rates are coupled and how calibration navigates their trade-offs. The output is an auditable operational menu: for a fixed scoring model, we trace attainable operational profiles across calibration settings and attach finite-window uncertainty envelopes. We demonstrate the approach on Tox21 toxicity prediction (12 endpoints) and aqueous solubility screening using AquaSolDB.

Conformal Tradeoffs: Guarantees Beyond Coverage

TL;DR

Abstract

to a calibrated grid point with PAC-style semantics, yielding explicit finite-window coverage guarantees. (2) Calibrate-and-Audit: since no distribution-free pivot exists for rates beyond coverage, we introduce a two-stage design in which an independent audit set produces a reusable region -- label table and certified finite-window envelopes (Binomial/Beta-Binomial) for operational quantities -- commitment frequency, deferral, decisive error exposure, and commit purity -- via linear projection. (3) Geometric characterization: we describe feasibility constraints, regime boundaries (hedging vs.\ rejection), and cost-coherence conditions induced by a fixed conformal partition, explaining why operational rates are coupled and how calibration navigates their trade-offs. The output is an auditable operational menu: for a fixed scoring model, we trace attainable operational profiles across calibration settings and attach finite-window uncertainty envelopes. We demonstrate the approach on Tox21 toxicity prediction (12 endpoints) and aqueous solubility screening using AquaSolDB.

Paper Structure (139 sections, 3 theorems, 120 equations, 6 figures, 9 tables, 1 algorithm)

This paper contains 139 sections, 3 theorems, 120 equations, 6 figures, 9 tables, 1 algorithm.

Introduction: deployment-facing conformal prediction
The gap: coverage does not determine operational behavior
Our approach: calibrate, expose geometry, and audit once
Contributions
(1) Coverage semantics via Small-Sample Beta Correction (SSBC).
(2) Calibrate-and-Audit for operational certification beyond coverage.
(3) Geometry and feasibility of operational trade-offs.
Positioning and related work
Roadmap
Setting and notation: calibration-conditional viewpoint
Data splits and exchangeability assumptions
Scores and calibration thresholds
Regions, observability, and policies
Auditing primitives: region--label tables and policy projections
(A) Region--label joint table.
...and 124 more sections

Key Result

Proposition 2

Fix $n_{\mathrm{cal}}$, $\alpha^\star$, and $\delta$. Let $u_m$ be the SSBC-selected index obtained by inverting the Beta--Binomial tail for window size $m$, and let $u_\infty$ be the SSBC-selected index obtained by inverting the Beta tail. Then

Figures (6)

Figure 1: Operational menu induced by a calibrated conformal rule on a synthetic model. Left: a calibration choice fixes score cutpoints that partition score space into set-valued outcomes (e.g., singleton, hedge, abstain) and associated region--label probabilities. A deployment convention determines which outcomes trigger commitment versus deferral. Right: sweeping calibration settings traces the attainable set of selected operational rates; the oriented Pareto frontier highlights nondominated operating regimes.
Figure 2: Region definitions and region mass under different data regimes. (A) Thresholds $\tau_0,\tau_1$ induce a finite region partition $R_\tau(x)$ independent of data and policy. (B--C) Under probability-normalized scores, data support is restricted to the diagonal, determining which regions carry nonzero mass for different threshold configurations. (D) With unconstrained scores, all regions may carry mass. No deployment policy is applied in this figure.
Figure 3: Deployment policies as projections on a fixed region structure. Panel A shows the region partition $R_\tau(x)$. Panels B--D apply different region-based deployment policies $\pi$ to the same regions and data support, yielding reported outputs $\hat{C}_\pi(x)=\pi(R_\tau(x))$. Differences between panels arise solely from the choice of policy.
Figure 4: Operational rate envelopes and underlying score distributions. (A) Singleton rate and (B) singleton error for two class prevalences ($p_{\mathrm{class}}=0.10, 0.50$) and two class 1 generating distributions ($\mathrm{Beta}(4,3)$ and $\mathrm{Beta}(9,3)$), with class 0 generated by $\mathrm{Beta}(2,7)$. Red rectangles denote the two-sample Calibrate--and--Audit Beta--Binomial (BB) predictive envelope, and the dashed vertical line is the corresponding BB point estimate. Blue and orange markers with horizontal intervals show leave-one-out (LOO) envelopes computed from a single calibration dataset under two inflation levels ($\mathrm{infl}=1,2$). (C1--C4) Histograms of the score geometry shown as predicted class 1 probability (equivalently, nonconformity to class 0) stratified by true class. These distributions make explicit how class prevalence and calibration geometry shape singleton mass and singleton error, and explain the regime-dependent asymmetry of the finite-sample envelopes.
Figure 5: Solubility scenario planning via conformal operating regimes and cost-coherence. Left: attainable joint-rate map induced by sweeping SSBC calibration settings (after deduplication). Axes report joint rates on soluble compounds: irreversible exclusion $P(Y=\mathrm{Sol}, C=\{\mathrm{Insol}\})$ (x-axis) and deferral burden $P(Y=\mathrm{Sol}, C=\{\mathrm{Sol},\mathrm{Insol}\})$ (y-axis). Color encodes the joint decisive-correct rate $P(Y=\mathrm{Sol}, C=\{\mathrm{Sol}\})$. Red rings mark Pareto-optimal operating points; thick rings mark unique KPI regimes, labeled by regime_id:multiplicity. Right: cost-coherence landscape for the fixed downstream convention $\{\mathrm{Sol}\}\!\to\!\mathrm{Sol},\ \{\mathrm{Insol}\}\!\to\!\mathrm{Insol},\ \{\mathrm{Sol},\mathrm{Insol}\}\!\to\!\mathrm{rej}$, parameterized by cost ratios $\lambda=c_{01}/c_{10}$ (irreversible loss / downstream waste) and $\rho=c_{\mathrm{rej}}/c_{10}$ (deferral / downstream waste). The heatmap shows the fraction of Pareto regimes for which this convention is cost-coherent given only the conformal output (via within-region label composition). Blue and red outlines give the union and intersection of feasible $(\lambda,\rho)$ regions across the Pareto front; thin outlines show representative wedges for the unique KPI regimes.
...and 1 more figures

Theorems & Definitions (6)

Remark 1: Discrete scores and ties
Proposition 2: Infinite-window limit of SSBC
Lemma 3: Order-statistic coupling under reuse
proof
Proposition 4: Regime boundary under probability normalization
proof

Conformal Tradeoffs: Guarantees Beyond Coverage

TL;DR

Abstract

Conformal Tradeoffs: Guarantees Beyond Coverage

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (6)

Theorems & Definitions (6)