When More Experts Hurt: Underfitting in Multi-Expert Learning to Defer

Shuqi Liu; Yuzhou Cao; Lei Feng; Bo An; Luke Ong

When More Experts Hurt: Underfitting in Multi-Expert Learning to Defer

Shuqi Liu, Yuzhou Cao, Lei Feng, Bo An, Luke Ong

TL;DR

This work reveals that expanding the expert pool in Learning to Defer (L2D) induces inherent underfitting due to the expert aggregation term, a phenomenon absent in single-expert L2D. It develops PiCCE (Pick the Confident and Correct Expert), a continuous surrogate that regresses the problem toward a single-expert-like learning by constraining expert selection to empirically correct experts and using ground-truth evidence. The authors prove PiCCE’s optimization continuity, classifier consistency, and L2D-system consistency under standard losses, and demonstrate through extensive experiments on synthetic and real-world data that PiCCE significantly improves system accuracy and coverage as the number of experts grows. The results indicate PiCCE effectively mitigates multi-expert underfitting, offering robust performance in realistic settings with diverse expert pools.

Abstract

Learning to Defer (L2D) enables a classifier to abstain from predictions and defer to an expert, and has recently been extended to multi-expert settings. In this work, we show that multi-expert L2D is fundamentally more challenging than the single-expert case. With multiple experts, the classifier's underfitting becomes inherent, which seriously degrades prediction performance, whereas in the single-expert setting it arises only under specific conditions. We theoretically reveal that this stems from an intrinsic expert identifiability issue: learning which expert to trust from a diverse pool, a problem absent in the single-expert case and renders existing underfitting remedies failed. To tackle this issue, we propose PiCCE (Pick the Confident and Correct Expert), a surrogate-based method that adaptively identifies a reliable expert based on empirical evidence. PiCCE effectively reduces multi-expert L2D to a single-expert-like learning problem, thereby resolving multi expert underfitting. We further prove its statistical consistency and ability to recover class probabilities and expert accuracies. Extensive experiments across diverse settings, including real-world expert scenarios, validate our theoretical results and demonstrate improved performance.

When More Experts Hurt: Underfitting in Multi-Expert Learning to Defer

TL;DR

Abstract

Paper Structure (49 sections, 5 theorems, 32 equations, 3 figures, 4 tables)

This paper contains 49 sections, 5 theorems, 32 equations, 3 figures, 4 tables.

Introduction
Preliminaries
Problem Formulation and Existing Losses for L2D with Multiple Experts
Data Generating Distribution:
Technical Notations:
Problem Setup:
Existing Loss Frameworks:
Underfitting Issues in L2D
More Experts, Worse Performance: A New Underfitting Challenge
Multi-Expert Can Cause Underfitting
Failure of Merely Using Intermediate Results
PiCCE: Using Both Intermediate and Empirical Results
Regulating Confident Experts with Ground-truth
Underfitting-resistance of PiCCE
Consistency Guarantee
...and 34 more sections

Key Result

Theorem 2

The proposed formulation Loss_PiCCE is continuous if $\phi$ is continuous and is symmetric w.r.t. its last $J$ inputs, i.e., $P\bm{\phi}(\bm{\theta})=\bm{\phi}(P\bm{\theta})$ for permutation matrices $P\in\mathbb{R}^{K+J\times K+J}$ that $P_{i,i}\!=\!1$ for $i\in[K]$, and $\bm{\phi}(\bm{\theta})=[\p

Figures (3)

Figure 1: Left: Illustration of underfitting when using multi-expert CE surrogate loss proposed by DBLP:conf/aistats/VermaBN23 on ImageNet. We consider a MobileNet-v2 model and progressively introduce "dog experts", where each expert covers a domain consisting of 5 dog species, attaining $85\%$ accuracy on its domain, $75\%$ on the other dog species, and random guessing on remaining classes. Since the experts have non-overlapping domains, adding more experts strictly increases the aggregate accuracy of the expert set. We report the test accuracy of both the system and the classifier. Right: An illustration of the degraded distribution for a 5-class classification task. We present the predicted class-posterior probabilities for an instance in descending order before and after the introduction of three experts.
Figure 2: Classifier accuracy vs. number of experts on synthetic (CIFAR-100, ImageNet) and real-world expert datasets (MiceBone, Chaoyang). Solid lines denote methods derived from \ref{['systemloss']}, while dashed lines correspond to those derived from \ref{['Loss_PiCCE']}. Across all datasets, existing methods exhibit a performance drop as the number of experts increases, whereas PiCCE remains stable.
Figure 3: We report the classifier's accuracy on CIFAR-100 and ImageNet datasets. Solid lines are the methods from existing formulation \ref{['systemloss']} while dashed lines are the methods derived from our PiCCE formulation \ref{['Loss_PiCCE']}.

Theorems & Definitions (8)

Definition 1: PiCCE
Theorem 2: Continuity of PiCCE
Lemma 3: Risk of PiCCE
Lemma 4
Lemma 5: Consistency of Optimal Classifiers
Theorem 6: Consistency and Expert Accuracy Estimator
Example 1: Dominant Expert
Example 2: Expert Only Dominant at the Major Class

When More Experts Hurt: Underfitting in Multi-Expert Learning to Defer

TL;DR

Abstract

When More Experts Hurt: Underfitting in Multi-Expert Learning to Defer

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (3)

Theorems & Definitions (8)