Table of Contents
Fetching ...

Optimal Query Allocation in Extractive QA with LLMs: A Learning-to-Defer Framework with Theoretical Guarantees

Yannis Montreuil, Shu Heng Yeo, Axel Carlier, Lai Xing Ng, Wei Tsang Ooi

TL;DR

This work tackles efficient and reliable extractive QA on resource-constrained devices by introducing Learning-to-Defer, a framework that directly allocates input queries across a main model and on-demand experts. By formulating a true deferral loss and a differentiable surrogate deferral loss, the authors establish Bayes-consistency guarantees that ensure the learned rejector approaches the optimal deferral rule, allocating queries to the most confident agent. The approach integrates a lightweight rejector (TinyBERT) with offline expert models and demonstrates improved accuracy with reduced computational cost on SQuADv1, SQuADv2, and TriviaQA, outperforming larger models and ensembles in efficiency. Theoretical guarantees, surrogate loss design, and empirical results collectively support scalable, cost-aware EQA deployment in edge environments, with avenues for extending to broader tasks and dynamic deferral costs.

Abstract

Large Language Models excel in generative tasks but exhibit inefficiencies in structured text selection, particularly in extractive question answering. This challenge is magnified in resource-constrained environments, where deploying multiple specialized models for different tasks is impractical. We propose a Learning-to-Defer framework that allocates queries to specialized experts, ensuring high-confidence predictions while optimizing computational efficiency. Our approach integrates a principled allocation strategy with theoretical guarantees on optimal deferral that balances performance and cost. Empirical evaluations on SQuADv1, SQuADv2, and TriviaQA demonstrate that our method enhances answer reliability while significantly reducing computational overhead, making it well-suited for scalable and efficient EQA deployment.

Optimal Query Allocation in Extractive QA with LLMs: A Learning-to-Defer Framework with Theoretical Guarantees

TL;DR

This work tackles efficient and reliable extractive QA on resource-constrained devices by introducing Learning-to-Defer, a framework that directly allocates input queries across a main model and on-demand experts. By formulating a true deferral loss and a differentiable surrogate deferral loss, the authors establish Bayes-consistency guarantees that ensure the learned rejector approaches the optimal deferral rule, allocating queries to the most confident agent. The approach integrates a lightweight rejector (TinyBERT) with offline expert models and demonstrates improved accuracy with reduced computational cost on SQuADv1, SQuADv2, and TriviaQA, outperforming larger models and ensembles in efficiency. Theoretical guarantees, surrogate loss design, and empirical results collectively support scalable, cost-aware EQA deployment in edge environments, with avenues for extending to broader tasks and dynamic deferral costs.

Abstract

Large Language Models excel in generative tasks but exhibit inefficiencies in structured text selection, particularly in extractive question answering. This challenge is magnified in resource-constrained environments, where deploying multiple specialized models for different tasks is impractical. We propose a Learning-to-Defer framework that allocates queries to specialized experts, ensuring high-confidence predictions while optimizing computational efficiency. Our approach integrates a principled allocation strategy with theoretical guarantees on optimal deferral that balances performance and cost. Empirical evaluations on SQuADv1, SQuADv2, and TriviaQA demonstrate that our method enhances answer reliability while significantly reducing computational overhead, making it well-suited for scalable and efficient EQA deployment.

Paper Structure

This paper contains 40 sections, 6 theorems, 34 equations, 7 figures, 3 tables, 2 algorithms.

Key Result

Lemma 0

Given an input $x \in \mathcal{X}$ and any distribution $\mathcal{D}$, the optimal rejection rule that minimizes the risk associated with the true deferral loss is given by: with $j^\ast = \mathop{\mathrm{arg\,min}}\limits_{j \in [J]} \eta_j^i(x)$.

Figures (7)

  • Figure 1: Inference Step of Our Approach: The input data is processed through the rejector framework, which predicts both start and end spans. Based on the optimal rule defined in Equation \ref{['optimal_rule']}, the query is assigned to an agent that subsequently predicts the answer.
  • Figure 2: Comparison between the Exact Match metric and the Expert Allocation: (a) TriviaQA, (b) SQuADv1, (c) SQuADv2.
  • Figure 3: Combined Efficiency Comparison across benchmarks: (a) TriviaQA, (b) SQuADv1, (c) SQuADv2.
  • Figure 4: Combined Allocation Percentage across benchmarks: (a) TriviaQA, (b) SQuADv1, (c) SQuADv2.
  • Figure 5: From left to right: Model Cascades, Query Routing, Learning-To-Defer (Ours), we support the multi-model nature of Model Cascades while allowing for direct inferences in Query Routing approaches.
  • ...and 2 more figures

Theorems & Definitions (8)

  • Definition 0: True Deferral Loss
  • Lemma 0: Bayes-Rejector
  • Definition 0: Surrogate Deferral Loss
  • Theorem 1: $(\mc{R}, \mc{G})$--consistency
  • Lemma 1: Optimal Deferral Rule for Single Allocation
  • Lemma 1: Bayes-Rejector
  • Theorem 1: $(\mc{R}, \mc{G})$--consistency
  • Lemma 1: $\mc{R}^i$-consistency bound