Table of Contents
Fetching ...

Why Ask One When You Can Ask $k$? Learning-to-Defer to the Top-$k$ Experts

Yannis Montreuil, Axel Carlier, Lai Xing Ng, Wei Tsang Ooi

TL;DR

This work extends Learning-to-Defer (L2D) to deferring to multiple experts by introducing Top-$k$ L2D, a framework that allocates queries to the $k$ most cost-effective entities and unifies one-stage and two-stage regimes. It further adds Top-$k(x)$, an adaptive variant that learns the optimal number of consulted experts per input, under a $k$-independent convex surrogate loss with Bayes-, $ ext{$ ext{H}_h$}$-, and $( ext{$ ext{H}_r$}, ext{$ ext{H}_g$})$-consistency guarantees. Theoretical results show optimal top-$k$ selection is achieved by ranking entities by their expected costs, and the surrogates remain consistent for all $k$, thereby generalizing prior Top-1 L2D and selective prediction. Empirical results across both one- and two-stage settings demonstrate that Top-$k$ and Top-$k(x)$ achieve superior accuracy-cost trade-offs and robust performance across CIFAR-10, CIFAR-100, SVHN, and California Housing datasets, highlighting the practical value of multi-expert deferral.

Abstract

Existing Learning-to-Defer (L2D) frameworks are limited to single-expert deferral, forcing each query to rely on only one expert and preventing the use of collective expertise. We introduce the first framework for Top-$k$ Learning-to-Defer, which allocates queries to the $k$ most cost-effective entities. Our formulation unifies and strictly generalizes prior approaches, including the one-stage and two-stage regimes, selective prediction, and classical cascades. In particular, it recovers the usual Top-1 deferral rule as a special case while enabling principled collaboration with multiple experts when $k>1$. We further propose Top-$k(x)$ Learning-to-Defer, an adaptive variant that learns the optimal number of experts per query based on input difficulty, expert quality, and consultation cost. To enable practical learning, we develop a novel surrogate loss that is Bayes-consistent, $\mathcal{H}_h$-consistent in the one-stage setting, and $(\mathcal{H}_r,\mathcal{H}_g)$-consistent in the two-stage setting. Crucially, this surrogate is independent of $k$, allowing a single policy to be learned once and deployed flexibly across $k$. Experiments across both regimes show that Top-$k$ and Top-$k(x)$ deliver superior accuracy-cost trade-offs, opening a new direction for multi-expert deferral in L2D.

Why Ask One When You Can Ask $k$? Learning-to-Defer to the Top-$k$ Experts

TL;DR

This work extends Learning-to-Defer (L2D) to deferring to multiple experts by introducing Top- L2D, a framework that allocates queries to the most cost-effective entities and unifies one-stage and two-stage regimes. It further adds Top-, an adaptive variant that learns the optimal number of consulted experts per input, under a -independent convex surrogate loss with Bayes-, ext{H}_h-, and ext{H}_r ext{H}_g-consistency guarantees. Theoretical results show optimal top- selection is achieved by ranking entities by their expected costs, and the surrogates remain consistent for all , thereby generalizing prior Top-1 L2D and selective prediction. Empirical results across both one- and two-stage settings demonstrate that Top- and Top- achieve superior accuracy-cost trade-offs and robust performance across CIFAR-10, CIFAR-100, SVHN, and California Housing datasets, highlighting the practical value of multi-expert deferral.

Abstract

Existing Learning-to-Defer (L2D) frameworks are limited to single-expert deferral, forcing each query to rely on only one expert and preventing the use of collective expertise. We introduce the first framework for Top- Learning-to-Defer, which allocates queries to the most cost-effective entities. Our formulation unifies and strictly generalizes prior approaches, including the one-stage and two-stage regimes, selective prediction, and classical cascades. In particular, it recovers the usual Top-1 deferral rule as a special case while enabling principled collaboration with multiple experts when . We further propose Top- Learning-to-Defer, an adaptive variant that learns the optimal number of experts per query based on input difficulty, expert quality, and consultation cost. To enable practical learning, we develop a novel surrogate loss that is Bayes-consistent, -consistent in the one-stage setting, and -consistent in the two-stage setting. Crucially, this surrogate is independent of , allowing a single policy to be learned once and deployed flexibly across . Experiments across both regimes show that Top- and Top- deliver superior accuracy-cost trade-offs, opening a new direction for multi-expert deferral in L2D.

Paper Structure

This paper contains 78 sections, 16 theorems, 96 equations, 9 figures, 5 tables, 2 algorithms.

Key Result

Theorem 1

Suppose the surrogate $\Phi_{01}^u$ is $\mathcal{H}_h$-calibrated for any distribution $\mathcal{D}$. Then there exists a non-decreasing function $\Gamma_u^{-1} : \mathbb{R}_+ \to \mathbb{R}_+$, depending on $u$, such that for all $h \in \mathcal{H}_h$,

Figures (9)

  • Figure 1: Performance of Top-$k$ and Top-$k(x)$ L2D across varying budgets $\overline{\beta}$. Each plot reports a different metric: (a) minimum RMSE, (b) uniform average RMSE, and (c) weighted average RMSE (\ref{['app:exp_metrics']}). Our approach outperforms the Top-1 L2D baseline mao2024regressionmultiexpertdeferral.
  • Figure 2: Inference step of Top-1 L2D Narasimhanmao2023twostagemao2024regressionmultiexpertdeferralmozannar2021consistentmao2024principledapproacheslearningdefer: Given a query, we process it through the learned policy $\pi$. We select the entity with the highest score $\hat{\pi}(x)=\mathop{\mathrm{arg\,max}}\limits_{j\in\mathcal{A}}\pi(x,j)$. Then, we query this entity and make the final prediction.
  • Figure 3: Inference Step of Top-$k$ L2D: Given a query $x$, we first process it through the policy learned using Algorithm \ref{['alg:l2d_training']}. Based on this, we select a fixed number $k$ of entities to query, forming the Top-$k$ Selection Set$\Pi_k(x)$, as defined in Definition \ref{['def:top_k_set']}. By construction, the expected size satisfies $\mathbb{E}_{X}[|\Pi_k(X)|] = k$. We then aggregate predictions from the selected top-$k$ entities using a decision rule—such as majority vote or weighted voting. The final prediction is produced by this committee according to the chosen rule.
  • Figure 4: Inference Step of Top-$k(x)$ L2D: Given a query $x$, we process it through both the policy $\pi$, trained using Algorithm \ref{['alg:l2d_training']}, and the cardinality function $k_\theta$, trained using Algorithm \ref{['alg:cardinality_training']}. Based on these two functions, we construct the Top-$k$ Selection set. By construction, its expected size satisfies $\mathbb{E}_{X}[|\Pi_{\hat{k}_\theta(x)}(X)|] = \mathbb{E}_{X}[\hat{k}_\theta(X)]$. We then aggregate predictions from the top-$\hat{k}_\theta(x)$ entities using a decision rule (e.g., majority vote, weighted voting). The final prediction is produced by this committee of entities according to the chosen decision rule.
  • Figure 5: Comparison of Top-$k$ and Top-$k(x)$ One-Stage across four accuracy metrics on CIFAR-10. Top-$k(x)$ achieves better budget-accuracy trade-offs across all settings. For clarity, only the first 12 entities are shown. Results are averaged over 4 independent runs. The Top-$1$ L2D corresponds to mozannar2021consistentmao2024principledapproacheslearningdefer.
  • ...and 4 more figures

Theorems & Definitions (24)

  • Definition 0: One-Stage Deferral Loss
  • Definition 0: Two-Stage Deferral Loss
  • Theorem 1: $\mathcal{H}_h$-consistency bounds
  • Definition 1: Top-$k$ Selection Set
  • Remark 2
  • Lemma 2: Top-$k$ True Deferral Loss
  • Remark 3
  • Lemma 3: Upper Bound on the Top-$k$ Deferral Loss
  • Corollary 3: Surrogates for the Top-$k$ Deferral Loss
  • Lemma 3: Bayes-Optimal Top-$k$ Selection
  • ...and 14 more