Why Ask One When You Can Ask $k$? Learning-to-Defer to the Top-$k$ Experts
Yannis Montreuil, Axel Carlier, Lai Xing Ng, Wei Tsang Ooi
TL;DR
This work extends Learning-to-Defer (L2D) to deferring to multiple experts by introducing Top-$k$ L2D, a framework that allocates queries to the $k$ most cost-effective entities and unifies one-stage and two-stage regimes. It further adds Top-$k(x)$, an adaptive variant that learns the optimal number of consulted experts per input, under a $k$-independent convex surrogate loss with Bayes-, $ ext{$ ext{H}_h$}$-, and $( ext{$ ext{H}_r$}, ext{$ ext{H}_g$})$-consistency guarantees. Theoretical results show optimal top-$k$ selection is achieved by ranking entities by their expected costs, and the surrogates remain consistent for all $k$, thereby generalizing prior Top-1 L2D and selective prediction. Empirical results across both one- and two-stage settings demonstrate that Top-$k$ and Top-$k(x)$ achieve superior accuracy-cost trade-offs and robust performance across CIFAR-10, CIFAR-100, SVHN, and California Housing datasets, highlighting the practical value of multi-expert deferral.
Abstract
Existing Learning-to-Defer (L2D) frameworks are limited to single-expert deferral, forcing each query to rely on only one expert and preventing the use of collective expertise. We introduce the first framework for Top-$k$ Learning-to-Defer, which allocates queries to the $k$ most cost-effective entities. Our formulation unifies and strictly generalizes prior approaches, including the one-stage and two-stage regimes, selective prediction, and classical cascades. In particular, it recovers the usual Top-1 deferral rule as a special case while enabling principled collaboration with multiple experts when $k>1$. We further propose Top-$k(x)$ Learning-to-Defer, an adaptive variant that learns the optimal number of experts per query based on input difficulty, expert quality, and consultation cost. To enable practical learning, we develop a novel surrogate loss that is Bayes-consistent, $\mathcal{H}_h$-consistent in the one-stage setting, and $(\mathcal{H}_r,\mathcal{H}_g)$-consistent in the two-stage setting. Crucially, this surrogate is independent of $k$, allowing a single policy to be learned once and deployed flexibly across $k$. Experiments across both regimes show that Top-$k$ and Top-$k(x)$ deliver superior accuracy-cost trade-offs, opening a new direction for multi-expert deferral in L2D.
