Table of Contents
Fetching ...

No Need for Learning to Defer? A Training Free Deferral Framework to Multiple Experts through Conformal Prediction

Tim Bary, Benoît Macq, Louis Petit

Abstract

AI systems often struggle to provide reliable predictions across all inputs, motivating hybrid human-AI decision-making. Existing Learning to Defer (L2D) approaches address this by training models to selectively defer to human experts. However, these approaches require extensive training data annotated by all experts and are sensitive to changes in expert composition, necessitating costly retraining. We propose a training-free, model- and expert-agnostic framework for expert deferral based on conformal prediction. Our method leverages prediction sets from a conformal predictor to quantify label-specific uncertainty and selects the most suitable expert using a segregativity criterion, which measures how well an expert discriminates among plausible labels. Experiments across three models on CIFAR10-H and HAM10000 demonstrate that our method can reduce the number of training labels per expert by up to 91.3% while maintaining predictive accuracy in low-data regimes. Being training-free, it also reduces training time by two orders of magnitude, offering a scalable, alternative to L2D for real-world human-AI collaboration.

No Need for Learning to Defer? A Training Free Deferral Framework to Multiple Experts through Conformal Prediction

Abstract

AI systems often struggle to provide reliable predictions across all inputs, motivating hybrid human-AI decision-making. Existing Learning to Defer (L2D) approaches address this by training models to selectively defer to human experts. However, these approaches require extensive training data annotated by all experts and are sensitive to changes in expert composition, necessitating costly retraining. We propose a training-free, model- and expert-agnostic framework for expert deferral based on conformal prediction. Our method leverages prediction sets from a conformal predictor to quantify label-specific uncertainty and selects the most suitable expert using a segregativity criterion, which measures how well an expert discriminates among plausible labels. Experiments across three models on CIFAR10-H and HAM10000 demonstrate that our method can reduce the number of training labels per expert by up to 91.3% while maintaining predictive accuracy in low-data regimes. Being training-free, it also reduces training time by two orders of magnitude, offering a scalable, alternative to L2D for real-world human-AI collaboration.

Paper Structure

This paper contains 28 sections, 4 equations, 5 figures, 1 table.

Figures (5)

  • Figure 1: Our proposed deferral framework. Given an input $x$, a conformal predictor based on a pre-trained model produces a prediction set $\Gamma_\alpha(x)$. If $|\Gamma_\alpha(x)|>1$, the decision is deferred to the expert with the highest segregativity $\varsigma_k$. An expert’s segregativity is defined as its accuracy on the sub-matrix of its confusion matrix restricted to the labels within $\Gamma_\alpha(x)$.
  • Figure 2: Influence of the miscoverage rate $\alpha$ on system accuracy and expert workload for our framework on CIFAR10-H and HAM10000, across the considered conformal scoring functions. Results are averaged over five random seeds and three model architectures; per-model trends are consistent. Shaded regions indicate 95% confidence intervals.
  • Figure 3: Accuracy difference between L2D baselines and segregativity-based deferral on CIFAR10-H as a function of the number of expert-labeled samples. Points show per-seed differences, while curves and shaded regions denote the fitted mixed-effects model and its 95% confidence interval. The confidence bounds identify the dataset sizes at which L2D reliably matches or exceeds our framework’s accuracy using 1,500 expert labels.
  • Figure 4: System accuracy of segregativity-based deferral at $\alpha^*$ as a function of expert assessment set size for CIFAR10-H and HAM10000 across conformal scoring functions. Results are averaged over five random seeds and three model architectures; per-model trends are consistent. Shaded regions denote 95% confidence intervals.
  • Figure 5: Difference in system accuracy between optimal hyper-parameters selected in hindsight and those estimated using meta-validation sets of increasing size. Results are averaged over five random seeds and three model architectures; per-model trends are consistent. Shaded regions denote 95% confidence intervals.