Cross-Domain Uncertainty Quantification for Selective Prediction: A Comprehensive Bound Ablation with Transfer-Informed Betting

Abhinaba Basu

Cross-Domain Uncertainty Quantification for Selective Prediction: A Comprehensive Bound Ablation with Transfer-Informed Betting

Abhinaba Basu

TL;DR

It is proved that the TIB wealth process remains a valid supermartingale under all source-target divergences, that TIB dominates standard WSR when domains match, and that no data-independent warm-start can achieve better convergence.

Abstract

We present a comprehensive ablation of nine finite-sample bound families for selective prediction with risk control, combining concentration inequalities (Hoeffding, Empirical Bernstein, Clopper-Pearson, Wasserstein DRO, CVaR) with multiple-testing corrections (union bound, Learn Then Test fixed-sequence) and betting-based confidence sequences (WSR). Our main theoretical contribution is Transfer-Informed Betting (TIB), which warm-starts the WSR wealth process using a source domain's risk profile, achieving tighter bounds in data-scarce settings with a formal dominance guarantee. We prove that the TIB wealth process remains a valid supermartingale under all source-target divergences, that TIB dominates standard WSR when domains match, and that no data-independent warm-start can achieve better convergence. The combination of betting-based confidence sequences, LTT monotone testing, and cross-domain transfer is, to our knowledge, a three-way novelty not present in the literature. We evaluate all nine bound families on four benchmarks-MASSIVE (n=1,102), NyayaBench (n=280), CLINC-150 (n=22.5K), and Banking77 (n=13K)-across 18 (alpha, delta) configurations. On MASSIVE at alpha=0.10, LTT eliminates the ln(K) union-bound penalty, achieving 94.0% guaranteed coverage versus 73.8% for Hoeffding-a 27% relative improvement. On NyayaBench, where the small calibration set makes Hoeffding-family bounds infeasible below alpha=0.20, Transfer-Informed Betting achieves 18.5% coverage at alpha=0.10, a 5.4x improvement over LTT + Hoeffding. We additionally compare with split-conformal prediction, showing that conformal methods produce prediction sets (avg. 1.67 classes) whereas selective prediction provides single-prediction risk guarantees. We apply these methods to agentic caching systems, formalizing a progressive trust model where the guarantee determines when cached responses can be served autonomously.

Cross-Domain Uncertainty Quantification for Selective Prediction: A Comprehensive Bound Ablation with Transfer-Informed Betting

TL;DR

Abstract

Paper Structure (28 sections, 4 theorems, 21 equations, 6 figures, 5 tables)

This paper contains 28 sections, 4 theorems, 21 equations, 6 figures, 5 tables.

Introduction
Problem Formulation
Agent Caching as Selective Prediction
RCPS Framework
Bound Families
Hoeffding + Union Bound (Baseline)
Empirical Bernstein + Union Bound
LTT Fixed-Sequence Testing
Wasserstein DRO
CVaR Tail-Risk Bounds
Clopper-Pearson Exact Binomial + LTT
Betting-Based Bounds
PAC-Bayes Cross-Domain Transfer
Formulation
Cross-Domain Transfer for Agent Caching
...and 13 more sections

Key Result

Proposition 1

Let $C(n, \delta)$ be a finite-sample correction satisfying $\Pr[\hat{R}(\tau) + C(n, \delta) \geq R(\tau)] \geq 1 - \delta$ for a single $\tau$, and let $\tau_K > \tau_{K-1} > \cdots > \tau_1$ be tested in decreasing order using a valid multiple-testing correction. Then satisfies $\Pr[R(\tau^*) \leq \alpha] \geq 1 - \delta$.

Figures (6)

Figure 1: Reliability diagrams for SetFit confidence calibration. (a) MASSIVE: bars show pre-calibration binned accuracy; dots show post-temperature-scaling. (b) NyayaBench v2: higher ECE reflects the harder 20-class task.
Figure 2: Test coverage as a function of risk tolerance $\alpha$ ($\delta{=}0.10$). (a) MASSIVE: LTT + Emp. Bernstein (teal) dominates Hoeffding (orange) at all $\alpha$. (b) NyayaBench v2: PAC-Bayes with transfer (red) is the only method achieving meaningful coverage below $\alpha{=}0.15$.
Figure 3: Guaranteed coverage as a function of calibration set size ($\alpha{=}0.10$, $\delta{=}0.10$). (a) MASSIVE: LTT + Hoeffding (teal) reaches 62% coverage at $n{=}150$ and 94% at $n{=}549$; Hoeffding + union (orange) remains infeasible until $n{=}400$. (b) NyayaBench v2: PAC-Bayes transfer (red) is the only method achieving coverage at any $n$, stabilizing at ${\approx}14\%$ from $n{=}50$ onward. Shaded regions show $\pm 1$ standard deviation across 20 subsamples.
Figure 4: Correction term $C(n, \delta{=}0.10)$ as a function of calibration set size. The dashed line marks $\alpha{=}0.10$; a method achieves feasibility when its correction falls below this line (assuming $\hat{R}(\tau) \approx 0$). LTT + Hoeffding crosses at $n \approx 120$; Hoeffding + union bound requires $n \approx 350$.
Figure 5: Per-intent guaranteed coverage ($\alpha{=}0.10$, $\delta{=}0.10$). (a) MASSIVE: only check_calendar ($n_\text{cal}{=}167$, 92.2% accuracy) achieves subgroup feasibility with LTT + Bernstein, at 60.7% coverage. All other intents have insufficient per-class calibration data. (b) NyayaBench v2: no intent achieves per-class feasibility (largest class has $n_\text{cal}{=}16$).
...and 1 more figures

Theorems & Definitions (9)

Definition 1: Unsafe cache hit rate
Definition 2: Coverage
Proposition 1: Risk-controlled cache threshold
Definition 3: Transfer-Informed Betting
Theorem 1: Transfer-Informed Betting dominance
proof
Corollary 1: Finite-sample convergence rate
proof
Proposition 2: Optimality of source-informed warm-start

Cross-Domain Uncertainty Quantification for Selective Prediction: A Comprehensive Bound Ablation with Transfer-Informed Betting

TL;DR

Abstract

Cross-Domain Uncertainty Quantification for Selective Prediction: A Comprehensive Bound Ablation with Transfer-Informed Betting

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (6)

Theorems & Definitions (9)