FeB4RAG: Evaluating Federated Search in the Context of Retrieval Augmented Generation

Shuai Wang; Ekaterina Khramtsova; Shengyao Zhuang; Guido Zuccon

FeB4RAG: Evaluating Federated Search in the Context of Retrieval Augmented Generation

Shuai Wang, Ekaterina Khramtsova, Shengyao Zhuang, Guido Zuccon

TL;DR

FeB4RAG solves the mismatch between traditional federated search benchmarks and Retrieval Augmented Generation by introducing a 790-query, 16-dataset federated search collection built on BEIR with dense retrievers and LLM-based relevance judgments. It provides both result- and engine-level relevance labels, plus an evaluation framework to compare naive and optimized federation strategies within RAG pipelines. The study demonstrates substantial agreement between LLM judgments and human annotations and shows that resource-aware, high-quality federation improves RAG answer generation. The dataset and accompanying tooling enable systematic development and evaluation of federated search methods for conversational agents and enterprise information access.

Abstract

Federated search systems aggregate results from multiple search engines, selecting appropriate sources to enhance result quality and align with user intent. With the increasing uptake of Retrieval-Augmented Generation (RAG) pipelines, federated search can play a pivotal role in sourcing relevant information across heterogeneous data sources to generate informed responses. However, existing datasets, such as those developed in the past TREC FedWeb tracks, predate the RAG paradigm shift and lack representation of modern information retrieval challenges. To bridge this gap, we present FeB4RAG, a novel dataset specifically designed for federated search within RAG frameworks. This dataset, derived from 16 sub-collections of the widely used \beir benchmarking collection, includes 790 information requests (akin to conversational queries) tailored for chatbot applications, along with top results returned by each resource and associated LLM-derived relevance judgements. Additionally, to support the need for this collection, we demonstrate the impact on response generation of a high quality federated search system for RAG compared to a naive approach to federated search. We do so by comparing answers generated through the RAG pipeline through a qualitative side-by-side comparison. Our collection fosters and supports the development and evaluation of new federated search methods, especially in the context of RAG pipelines.

FeB4RAG: Evaluating Federated Search in the Context of Retrieval Augmented Generation

TL;DR

Abstract

Paper Structure (24 sections, 1 equation, 8 figures, 5 tables)

This paper contains 24 sections, 1 equation, 8 figures, 5 tables.

Introduction
Limitation of Available Federated Search Collections
Dataset Creation
Search Engine Selection
User Requests Creation
Relevance Labelling
Search Result Preparation
Labelling Search Results
Labelling Search Engines
Analysis of Relevance Labelling
Labelling Statistics
LLM-based Labels vs. Human Annotations
Agreements between LLMs
Importance of each Resource
Importance of Resource Vertical
...and 9 more sections

Figures (8)

Figure 1: Architecture of Federated Search within RAG.
Figure 2: Cohen's Kappa between labels generated by the LLM and labels provided by humans (from the original datasets); red line indicates overall Kappa for all annotations.
Figure 3: We report the number of queries for which $n$ search engines contain relevant information; we vary $n$ from not to 16.
Figure 4: Highest graded precision among all resources within a vertical, over 790 user requests.
Figure 5: Cohen's Kappa between two LLM annotators: solar-11b and lgs-13b; red line indicates overall Kappa for all annotations.
...and 3 more figures

FeB4RAG: Evaluating Federated Search in the Context of Retrieval Augmented Generation

TL;DR

Abstract

FeB4RAG: Evaluating Federated Search in the Context of Retrieval Augmented Generation

Authors

TL;DR

Abstract

Table of Contents

Figures (8)