Optimal Query Allocation in Extractive QA with LLMs: A Learning-to-Defer Framework with Theoretical Guarantees
Yannis Montreuil, Shu Heng Yeo, Axel Carlier, Lai Xing Ng, Wei Tsang Ooi
TL;DR
This work tackles efficient and reliable extractive QA on resource-constrained devices by introducing Learning-to-Defer, a framework that directly allocates input queries across a main model and on-demand experts. By formulating a true deferral loss and a differentiable surrogate deferral loss, the authors establish Bayes-consistency guarantees that ensure the learned rejector approaches the optimal deferral rule, allocating queries to the most confident agent. The approach integrates a lightweight rejector (TinyBERT) with offline expert models and demonstrates improved accuracy with reduced computational cost on SQuADv1, SQuADv2, and TriviaQA, outperforming larger models and ensembles in efficiency. Theoretical guarantees, surrogate loss design, and empirical results collectively support scalable, cost-aware EQA deployment in edge environments, with avenues for extending to broader tasks and dynamic deferral costs.
Abstract
Large Language Models excel in generative tasks but exhibit inefficiencies in structured text selection, particularly in extractive question answering. This challenge is magnified in resource-constrained environments, where deploying multiple specialized models for different tasks is impractical. We propose a Learning-to-Defer framework that allocates queries to specialized experts, ensuring high-confidence predictions while optimizing computational efficiency. Our approach integrates a principled allocation strategy with theoretical guarantees on optimal deferral that balances performance and cost. Empirical evaluations on SQuADv1, SQuADv2, and TriviaQA demonstrate that our method enhances answer reliability while significantly reducing computational overhead, making it well-suited for scalable and efficient EQA deployment.
