MBA-RAG: a Bandit Approach for Adaptive Retrieval-Augmented Generation through Question Complexity
Xiaqiang Tang, Qiang Gao, Jian Li, Nan Du, Qi Li, Sihong Xie
TL;DR
MBA-RAG addresses inefficiencies in retrieval-augmented generation by applying a multi-armed bandit to adaptively choose among retrieval strategies based on query complexity, guided by a dynamic reward that balances accuracy and retrieval cost. It encodes queries with DistilBERT, selects actions via an epsilon-greedy policy, and updates a predictive reward model to minimize squared error $ (r_a - f_ heta(x)_a)^2 $ with reward $ r_a = \mathcal{A}(y, \hat{y}_a) - \lambda C(a) $. Across six datasets (three single-hop and three multi-hop), MBA-RAG achieves strong generation accuracy while reducing retrieval steps by about 20%, outperforming or matching baselines and highlighting the value of partial supervision and cost-aware learning in RAG. The approach demonstrates practical gains for knowledge-intensive NLP tasks and offers a scalable, flexible framework for dynamic retrieval that can adapt to varying query complexities. Overall, MBA-RAG contributes a principled mechanism to balance quality and efficiency in RAG systems, with implications for deploying cost-effective, high-performing knowledge-enabled language models.
Abstract
Retrieval Augmented Generation (RAG) has proven to be highly effective in boosting the generative performance of language model in knowledge-intensive tasks. However, existing RAG framework either indiscriminately perform retrieval or rely on rigid single-class classifiers to select retrieval methods, leading to inefficiencies and suboptimal performance across queries of varying complexity. To address these challenges, we propose a reinforcement learning-based framework that dynamically selects the most suitable retrieval strategy based on query complexity. % our solution Our approach leverages a multi-armed bandit algorithm, which treats each retrieval method as a distinct ``arm'' and adapts the selection process by balancing exploration and exploitation. Additionally, we introduce a dynamic reward function that balances accuracy and efficiency, penalizing methods that require more retrieval steps, even if they lead to a correct result. Our method achieves new state of the art results on multiple single-hop and multi-hop datasets while reducing retrieval costs. Our code are available at https://github.com/FUTUREEEEEE/MBA .
