Table of Contents
Fetching ...

Learning-to-Defer with Expert-Conditioned Advice

Yannis Montreuil, Leina Montreuil, Axel Carlier, Lai Xing Ng, Wei Tsang Ooi

Abstract

Learning-to-Defer routes each input to the expert that minimizes expected cost, but it assumes that the information available to every expert is fixed at decision time. Many modern systems violate this assumption: after selecting an expert, one may also choose what additional information that expert should receive, such as retrieved documents, tool outputs, or escalation context. We study this problem and call it Learning-to-Defer with advice. We show that a broad family of natural separated surrogates, which learn routing and advice with distinct heads, are inconsistent even in the smallest non-trivial setting. We then introduce an augmented surrogate that operates on the composite expert--advice action space and prove an $\mathcal{H}$-consistency guarantee together with an excess-risk transfer bound, yielding recovery of the Bayes-optimal policy in the limit. Experiments on tabular, LLMs, and multi-modal tasks show that the resulting method improves over standard Learning-to-Defer while adapting its advice-acquisition behavior to the cost regime.

Learning-to-Defer with Expert-Conditioned Advice

Abstract

Learning-to-Defer routes each input to the expert that minimizes expected cost, but it assumes that the information available to every expert is fixed at decision time. Many modern systems violate this assumption: after selecting an expert, one may also choose what additional information that expert should receive, such as retrieved documents, tool outputs, or escalation context. We study this problem and call it Learning-to-Defer with advice. We show that a broad family of natural separated surrogates, which learn routing and advice with distinct heads, are inconsistent even in the smallest non-trivial setting. We then introduce an augmented surrogate that operates on the composite expert--advice action space and prove an -consistency guarantee together with an excess-risk transfer bound, yielding recovery of the Bayes-optimal policy in the limit. Experiments on tabular, LLMs, and multi-modal tasks show that the resulting method improves over standard Learning-to-Defer while adapting its advice-acquisition behavior to the cost regime.
Paper Structure (86 sections, 21 theorems, 141 equations, 10 figures, 19 tables, 4 algorithms)

This paper contains 86 sections, 21 theorems, 141 equations, 10 figures, 19 tables, 4 algorithms.

Key Result

Lemma 2

For $\mathbb{P}_X$-a.e. $x$, the policy $(r^\star, q^\star)$ that minimizes $\mathbb{E}[\ell_{\mathrm{def\text{-}adv}}(r, q;\, X, A, Y, \mathbf{e})]$ over all measurable policies satisfies:

Figures (10)

  • Figure 1: The deferral-advice protocol. From the input $x$, the router produces expert scores and the query model produces a $J\times(K+1)$ query-score matrix. Taking the argmax along each row yields one advice index per expert, but only the routed row is ever executed: the protocol uses the pair $\bigl(r(x),\,q(x,r(x))\bigr)$, reveals the corresponding masked advice, and executes expert $r(x)$ under that advice. When $q(x,r(x))=0$, the revealed advice is the null masked advice $\widetilde{a}^{(0)}$.
  • Figure 2: Structured parameterization of the composite policy score. The final decision remains a single argmax over executed expert--advice pairs, while the score decomposes into a routing term and an expert-conditional advice adjustment.
  • Figure 3: Synthetic test excess risk as a function of train size. Our method approaches zero excess risk, whereas the exact separated surrogate and L2D remain bounded away from the Bayes optimum.
  • Figure 4: Learned decision maps at the largest train size ($n=5000$). The vertical dashed line marks the transition between the theorem region $R_-$ and the advice-helpful region $R_+$. Our method recovers the Bayes split, the separated surrogate selects the wrong expert on the left region, and L2D cannot realize the queried Bayes action on the right region.
  • Figure 5: Deployed advice distribution of our method at $\lambda=0$ in FEVER. The left panel shows the overall fraction of validation examples assigned to each advice action. The right panel conditions on the routed expert. When advice is free, the learned policy uses queried advice on most examples, but the preferred advice level already varies across experts.
  • ...and 5 more figures

Theorems & Definitions (27)

  • Definition 1: $\mathcal{H}_r$-consistency bound
  • Definition 2: True deferral-advice loss
  • Lemma 2: Bayes-optimal deferral-advice policy
  • Lemma 2: Advice acquisition condition
  • Lemma 2: Deferral with advice dominates standard deferral
  • Theorem 3: Broad Fisher inconsistency of separated router/query surrogates
  • Lemma 3: Augmented deferral-advice surrogate
  • Theorem 4: $\mathcal{H}_\pi$-consistency of the augmented surrogate
  • Corollary 5: Asymptotic Bayes-risk consistency
  • Example 1: Running conditional-cost table
  • ...and 17 more