Table of Contents
Fetching ...

Cross-Domain Uncertainty Quantification for Selective Prediction: A Comprehensive Bound Ablation with Transfer-Informed Betting

Abhinaba Basu

TL;DR

It is proved that the TIB wealth process remains a valid supermartingale under all source-target divergences, that TIB dominates standard WSR when domains match, and that no data-independent warm-start can achieve better convergence.

Abstract

We present a comprehensive ablation of nine finite-sample bound families for selective prediction with risk control, combining concentration inequalities (Hoeffding, Empirical Bernstein, Clopper-Pearson, Wasserstein DRO, CVaR) with multiple-testing corrections (union bound, Learn Then Test fixed-sequence) and betting-based confidence sequences (WSR). Our main theoretical contribution is Transfer-Informed Betting (TIB), which warm-starts the WSR wealth process using a source domain's risk profile, achieving tighter bounds in data-scarce settings with a formal dominance guarantee. We prove that the TIB wealth process remains a valid supermartingale under all source-target divergences, that TIB dominates standard WSR when domains match, and that no data-independent warm-start can achieve better convergence. The combination of betting-based confidence sequences, LTT monotone testing, and cross-domain transfer is, to our knowledge, a three-way novelty not present in the literature. We evaluate all nine bound families on four benchmarks-MASSIVE (n=1,102), NyayaBench (n=280), CLINC-150 (n=22.5K), and Banking77 (n=13K)-across 18 (alpha, delta) configurations. On MASSIVE at alpha=0.10, LTT eliminates the ln(K) union-bound penalty, achieving 94.0% guaranteed coverage versus 73.8% for Hoeffding-a 27% relative improvement. On NyayaBench, where the small calibration set makes Hoeffding-family bounds infeasible below alpha=0.20, Transfer-Informed Betting achieves 18.5% coverage at alpha=0.10, a 5.4x improvement over LTT + Hoeffding. We additionally compare with split-conformal prediction, showing that conformal methods produce prediction sets (avg. 1.67 classes) whereas selective prediction provides single-prediction risk guarantees. We apply these methods to agentic caching systems, formalizing a progressive trust model where the guarantee determines when cached responses can be served autonomously.

Cross-Domain Uncertainty Quantification for Selective Prediction: A Comprehensive Bound Ablation with Transfer-Informed Betting

TL;DR

It is proved that the TIB wealth process remains a valid supermartingale under all source-target divergences, that TIB dominates standard WSR when domains match, and that no data-independent warm-start can achieve better convergence.

Abstract

We present a comprehensive ablation of nine finite-sample bound families for selective prediction with risk control, combining concentration inequalities (Hoeffding, Empirical Bernstein, Clopper-Pearson, Wasserstein DRO, CVaR) with multiple-testing corrections (union bound, Learn Then Test fixed-sequence) and betting-based confidence sequences (WSR). Our main theoretical contribution is Transfer-Informed Betting (TIB), which warm-starts the WSR wealth process using a source domain's risk profile, achieving tighter bounds in data-scarce settings with a formal dominance guarantee. We prove that the TIB wealth process remains a valid supermartingale under all source-target divergences, that TIB dominates standard WSR when domains match, and that no data-independent warm-start can achieve better convergence. The combination of betting-based confidence sequences, LTT monotone testing, and cross-domain transfer is, to our knowledge, a three-way novelty not present in the literature. We evaluate all nine bound families on four benchmarks-MASSIVE (n=1,102), NyayaBench (n=280), CLINC-150 (n=22.5K), and Banking77 (n=13K)-across 18 (alpha, delta) configurations. On MASSIVE at alpha=0.10, LTT eliminates the ln(K) union-bound penalty, achieving 94.0% guaranteed coverage versus 73.8% for Hoeffding-a 27% relative improvement. On NyayaBench, where the small calibration set makes Hoeffding-family bounds infeasible below alpha=0.20, Transfer-Informed Betting achieves 18.5% coverage at alpha=0.10, a 5.4x improvement over LTT + Hoeffding. We additionally compare with split-conformal prediction, showing that conformal methods produce prediction sets (avg. 1.67 classes) whereas selective prediction provides single-prediction risk guarantees. We apply these methods to agentic caching systems, formalizing a progressive trust model where the guarantee determines when cached responses can be served autonomously.
Paper Structure (28 sections, 4 theorems, 21 equations, 6 figures, 5 tables)

This paper contains 28 sections, 4 theorems, 21 equations, 6 figures, 5 tables.

Key Result

Proposition 1

Let $C(n, \delta)$ be a finite-sample correction satisfying $\Pr[\hat{R}(\tau) + C(n, \delta) \geq R(\tau)] \geq 1 - \delta$ for a single $\tau$, and let $\tau_K > \tau_{K-1} > \cdots > \tau_1$ be tested in decreasing order using a valid multiple-testing correction. Then satisfies $\Pr[R(\tau^*) \leq \alpha] \geq 1 - \delta$.

Figures (6)

  • Figure 1: Reliability diagrams for SetFit confidence calibration. (a) MASSIVE: bars show pre-calibration binned accuracy; dots show post-temperature-scaling. (b) NyayaBench v2: higher ECE reflects the harder 20-class task.
  • Figure 2: Test coverage as a function of risk tolerance $\alpha$ ($\delta{=}0.10$). (a) MASSIVE: LTT + Emp. Bernstein (teal) dominates Hoeffding (orange) at all $\alpha$. (b) NyayaBench v2: PAC-Bayes with transfer (red) is the only method achieving meaningful coverage below $\alpha{=}0.15$.
  • Figure 3: Guaranteed coverage as a function of calibration set size ($\alpha{=}0.10$, $\delta{=}0.10$). (a) MASSIVE: LTT + Hoeffding (teal) reaches 62% coverage at $n{=}150$ and 94% at $n{=}549$; Hoeffding + union (orange) remains infeasible until $n{=}400$. (b) NyayaBench v2: PAC-Bayes transfer (red) is the only method achieving coverage at any $n$, stabilizing at ${\approx}14\%$ from $n{=}50$ onward. Shaded regions show $\pm 1$ standard deviation across 20 subsamples.
  • Figure 4: Correction term $C(n, \delta{=}0.10)$ as a function of calibration set size. The dashed line marks $\alpha{=}0.10$; a method achieves feasibility when its correction falls below this line (assuming $\hat{R}(\tau) \approx 0$). LTT + Hoeffding crosses at $n \approx 120$; Hoeffding + union bound requires $n \approx 350$.
  • Figure 5: Per-intent guaranteed coverage ($\alpha{=}0.10$, $\delta{=}0.10$). (a) MASSIVE: only check_calendar ($n_\text{cal}{=}167$, 92.2% accuracy) achieves subgroup feasibility with LTT + Bernstein, at 60.7% coverage. All other intents have insufficient per-class calibration data. (b) NyayaBench v2: no intent achieves per-class feasibility (largest class has $n_\text{cal}{=}16$).
  • ...and 1 more figures

Theorems & Definitions (9)

  • Definition 1: Unsafe cache hit rate
  • Definition 2: Coverage
  • Proposition 1: Risk-controlled cache threshold
  • Definition 3: Transfer-Informed Betting
  • Theorem 1: Transfer-Informed Betting dominance
  • proof
  • Corollary 1: Finite-sample convergence rate
  • proof
  • Proposition 2: Optimality of source-informed warm-start