Table of Contents
Fetching ...

Unsupervised Query Routing for Retrieval Augmented Generation

Feiteng Mu, Liwen Zhang, Yong Jiang, Wenjie Li, Zhen Zhang, Pengjun Xie, Fei Huang

TL;DR

The paper tackles the challenge of routing queries to the most suitable search engine in retrieval-augmented generation without relying on annotated data. It introduces an unsupervised framework that uses multi-source retrieval as an upper bound to evaluate single-source responses, enabling automatic label generation from real user queries. Labels are derived from a combination of similarity (BertScore) and coherence (LLM-based ranking) metrics, with a ListMLE loss guiding the training of a routing model. Across five datasets and multiple LLMs, the approach demonstrates strong scalability and generalization, offering a practical path to scalable tool learning in RAG systems.

Abstract

Query routing for retrieval-augmented generation aims to assign an input query to the most suitable search engine. Existing works rely heavily on supervised datasets that require extensive manual annotation, resulting in high costs and limited scalability, as well as poor generalization to out-of-distribution scenarios. To address these challenges, we introduce a novel unsupervised method that constructs the "upper-bound" response to evaluate the quality of retrieval-augmented responses. This evaluation enables the decision of the most suitable search engine for a given query. By eliminating manual annotations, our approach can automatically process large-scale real user queries and create training data. We conduct extensive experiments across five datasets, demonstrating that our method significantly enhances scalability and generalization capabilities.

Unsupervised Query Routing for Retrieval Augmented Generation

TL;DR

The paper tackles the challenge of routing queries to the most suitable search engine in retrieval-augmented generation without relying on annotated data. It introduces an unsupervised framework that uses multi-source retrieval as an upper bound to evaluate single-source responses, enabling automatic label generation from real user queries. Labels are derived from a combination of similarity (BertScore) and coherence (LLM-based ranking) metrics, with a ListMLE loss guiding the training of a routing model. Across five datasets and multiple LLMs, the approach demonstrates strong scalability and generalization, offering a practical path to scalable tool learning in RAG systems.

Abstract

Query routing for retrieval-augmented generation aims to assign an input query to the most suitable search engine. Existing works rely heavily on supervised datasets that require extensive manual annotation, resulting in high costs and limited scalability, as well as poor generalization to out-of-distribution scenarios. To address these challenges, we introduce a novel unsupervised method that constructs the "upper-bound" response to evaluate the quality of retrieval-augmented responses. This evaluation enables the decision of the most suitable search engine for a given query. By eliminating manual annotations, our approach can automatically process large-scale real user queries and create training data. We conduct extensive experiments across five datasets, demonstrating that our method significantly enhances scalability and generalization capabilities.
Paper Structure (40 sections, 10 equations, 7 figures, 6 tables)

This paper contains 40 sections, 10 equations, 7 figures, 6 tables.

Figures (7)

  • Figure 1: The overall framework of our method, which mainly consists of four steps. We automatically construct supervision information for training our routing model.
  • Figure 2: The data scaling result of our method. For Best1Outof3 and Best2Outof3, we explicitly indicate the specific search strategy in parentheses. The values are obtained by averaging 5 random experiments.
  • Figure 3: The evaluation result of the generalization abilities of different methods. We report the Top1-Predicted result. Values are calculated by averaging 5 randomized experiments. We try three different types of training examples on five test sets. "$A \rightarrow B$" means we train the model on $A$ and evaluate the model on $B$. For example, "CDQA $\rightarrow$ WebQA" means we train the model on CDQA and evaluate the model on WebQA. Therefore, "CDQA $\rightarrow$ CDQA" and "WebQA $\rightarrow$ WebQA" are actually in an i.i.d. setting.
  • Figure 4: The visualization of the high-frequency words. The words are translated from Chinese. We leave the original version in Appendix \ref{['app:zh_visual']}.
  • Figure 5: The prompt for response generation. The prompt is translated from Chinese.
  • ...and 2 more figures