Table of Contents
Fetching ...

ReSLLM: Large Language Models are Strong Resource Selectors for Federated Search

Shuai Wang, Shengyao Zhuang, Bevan Koopman, Guido Zuccon

TL;DR

ReSLLM tackles resource selection in federated search for Retrieval-Augmented Generation by using zero-shot prompting to rank resources and a SLAT protocol to generate synthetic relevance labels for fine-tuning. Resources are scored with $score(q,r_i)=P(yes|q,r_i)-P(no|q,r_i)$ and ranked accordingly, while SLAT aggregates query-snippet judgments into 0–100 resource scores and trains ReSLLM with a contrastive loss. Empirically, zero-shot ReSLLM achieves competitive performance against unsupervised and embedding baselines, and the SLAT-tuned variant reaches performance on par with supervised baselines in several settings. This work demonstrates a practical, label-efficient pathway for effective resource selection in federated search, with meaningful implications for RAG pipelines and conversational AI.

Abstract

Federated search, which involves integrating results from multiple independent search engines, will become increasingly pivotal in the context of Retrieval-Augmented Generation pipelines empowering LLM-based applications such as chatbots. These systems often distribute queries among various search engines, ranging from specialized (e.g., PubMed) to general (e.g., Google), based on the nature of user utterances. A critical aspect of federated search is resource selection - the selection of appropriate resources prior to issuing the query to ensure high-quality and rapid responses, and contain costs associated with calling the external search engines. However, current SOTA resource selection methodologies primarily rely on feature-based learning approaches. These methods often involve the labour intensive and expensive creation of training labels for each resource. In contrast, LLMs have exhibited strong effectiveness as zero-shot methods across NLP and IR tasks. We hypothesise that in the context of federated search LLMs can assess the relevance of resources without the need for extensive predefined labels or features. In this paper, we propose ReSLLM. Our ReSLLM method exploits LLMs to drive the selection of resources in federated search in a zero-shot setting. In addition, we devise an unsupervised fine tuning protocol, the Synthetic Label Augmentation Tuning (SLAT), where the relevance of previously logged queries and snippets from resources is predicted using an off-the-shelf LLM and then in turn used to fine-tune ReSLLM with respect to resource selection. Our empirical evaluation and analysis details the factors influencing the effectiveness of LLMs in this context. The results showcase the merits of ReSLLM for resource selection: not only competitive effectiveness in the zero-shot setting, but also obtaining large when fine-tuned using SLAT-protocol.

ReSLLM: Large Language Models are Strong Resource Selectors for Federated Search

TL;DR

ReSLLM tackles resource selection in federated search for Retrieval-Augmented Generation by using zero-shot prompting to rank resources and a SLAT protocol to generate synthetic relevance labels for fine-tuning. Resources are scored with and ranked accordingly, while SLAT aggregates query-snippet judgments into 0–100 resource scores and trains ReSLLM with a contrastive loss. Empirically, zero-shot ReSLLM achieves competitive performance against unsupervised and embedding baselines, and the SLAT-tuned variant reaches performance on par with supervised baselines in several settings. This work demonstrates a practical, label-efficient pathway for effective resource selection in federated search, with meaningful implications for RAG pipelines and conversational AI.

Abstract

Federated search, which involves integrating results from multiple independent search engines, will become increasingly pivotal in the context of Retrieval-Augmented Generation pipelines empowering LLM-based applications such as chatbots. These systems often distribute queries among various search engines, ranging from specialized (e.g., PubMed) to general (e.g., Google), based on the nature of user utterances. A critical aspect of federated search is resource selection - the selection of appropriate resources prior to issuing the query to ensure high-quality and rapid responses, and contain costs associated with calling the external search engines. However, current SOTA resource selection methodologies primarily rely on feature-based learning approaches. These methods often involve the labour intensive and expensive creation of training labels for each resource. In contrast, LLMs have exhibited strong effectiveness as zero-shot methods across NLP and IR tasks. We hypothesise that in the context of federated search LLMs can assess the relevance of resources without the need for extensive predefined labels or features. In this paper, we propose ReSLLM. Our ReSLLM method exploits LLMs to drive the selection of resources in federated search in a zero-shot setting. In addition, we devise an unsupervised fine tuning protocol, the Synthetic Label Augmentation Tuning (SLAT), where the relevance of previously logged queries and snippets from resources is predicted using an off-the-shelf LLM and then in turn used to fine-tune ReSLLM with respect to resource selection. Our empirical evaluation and analysis details the factors influencing the effectiveness of LLMs in this context. The results showcase the merits of ReSLLM for resource selection: not only competitive effectiveness in the zero-shot setting, but also obtaining large when fine-tuned using SLAT-protocol.
Paper Structure (27 sections, 1 equation, 4 figures, 5 tables)

This paper contains 27 sections, 1 equation, 4 figures, 5 tables.

Figures (4)

  • Figure 1: The use of ResLLM for resource selection, including the SLAT protocol for fine-tuning.
  • Figure 2: Comparison of effectiveness with respect to the size of model in terms of ndcg@20. Flan-based models are used across all setting. $*$ indicates statistical significant differences against the flan-large model (two-tail paired t-test, $p<0.05$).
  • Figure 3: Comparison of effectiveness with respect to the architecture of the model. $*$ indicates statistical significant differences against the flan-xl model (two-tail paired t-test, $p<0.05$).
  • Figure 4: Comparison of effectiveness with respect to the resource representation in terms of ndcg@20. $N$ indicates Name; $D$ indicates Description; $S$ indicates Similar snippet. flan-xl is used across all setting. $*$ indicates statistical significant differences against $N$ for each setting. (two-tail paired t-test, $p<0.05$).