Table of Contents
Fetching ...

Is Relevance Propagated from Retriever to Generator in RAG?

Fangzheng Tian, Debasis Ganguly, Craig Macdonald

TL;DR

The paper tackles whether topical relevance of retrieved documents in a RAG setup translates into usable knowledge for the downstream task. It introduces an IR-based evaluation framework that treats context usefulness as a counterfactual utility $U(\theta(q)_k)$, i.e., the relative improvement in downstream performance when including the retrieved context versus 0-shot. It uses an IR test collection (MS MARCO and TREC-DL) and compares lexical (BM25) and neural (MonoT5) retrievers across $k$-shot context sizes to correlate retriever relevance $nDCG@k$ with generated-answer quality $F_{\text{BERT}}$. The study finds a small positive correlation between relevance and utility that weakens as context size grows, with more effective retrievers yielding higher downstream performance, and demonstrates the proposed framework with publicly available code. The results suggest that relevance is not fully transmitted through the generator and motivate future task-aware retrieval models, with the framework enabling reproducible evaluation in RAG systems.

Abstract

Retrieval Augmented Generation (RAG) is a framework for incorporating external knowledge, usually in the form of a set of documents retrieved from a collection, as a part of a prompt to a large language model (LLM) to potentially improve the performance of a downstream task, such as question answering. Different from a standard retrieval task's objective of maximising the relevance of a set of top-ranked documents, a RAG system's objective is rather to maximise their total utility, where the utility of a document indicates whether including it as a part of the additional contextual information in an LLM prompt improves a downstream task. Existing studies investigate the role of the relevance of a RAG context for knowledge-intensive language tasks (KILT), where relevance essentially takes the form of answer containment. In contrast, in our work, relevance corresponds to that of topical overlap between a query and a document for an information seeking task. Specifically, we make use of an IR test collection to empirically investigate whether a RAG context comprised of topically relevant documents leads to improved downstream performance. Our experiments lead to the following findings: (a) there is a small positive correlation between relevance and utility; (b) this correlation decreases with increasing context sizes (higher values of k in k-shot); and (c) a more effective retrieval model generally leads to better downstream RAG performance.

Is Relevance Propagated from Retriever to Generator in RAG?

TL;DR

The paper tackles whether topical relevance of retrieved documents in a RAG setup translates into usable knowledge for the downstream task. It introduces an IR-based evaluation framework that treats context usefulness as a counterfactual utility , i.e., the relative improvement in downstream performance when including the retrieved context versus 0-shot. It uses an IR test collection (MS MARCO and TREC-DL) and compares lexical (BM25) and neural (MonoT5) retrievers across -shot context sizes to correlate retriever relevance with generated-answer quality . The study finds a small positive correlation between relevance and utility that weakens as context size grows, with more effective retrievers yielding higher downstream performance, and demonstrates the proposed framework with publicly available code. The results suggest that relevance is not fully transmitted through the generator and motivate future task-aware retrieval models, with the framework enabling reproducible evaluation in RAG systems.

Abstract

Retrieval Augmented Generation (RAG) is a framework for incorporating external knowledge, usually in the form of a set of documents retrieved from a collection, as a part of a prompt to a large language model (LLM) to potentially improve the performance of a downstream task, such as question answering. Different from a standard retrieval task's objective of maximising the relevance of a set of top-ranked documents, a RAG system's objective is rather to maximise their total utility, where the utility of a document indicates whether including it as a part of the additional contextual information in an LLM prompt improves a downstream task. Existing studies investigate the role of the relevance of a RAG context for knowledge-intensive language tasks (KILT), where relevance essentially takes the form of answer containment. In contrast, in our work, relevance corresponds to that of topical overlap between a query and a document for an information seeking task. Specifically, we make use of an IR test collection to empirically investigate whether a RAG context comprised of topically relevant documents leads to improved downstream performance. Our experiments lead to the following findings: (a) there is a small positive correlation between relevance and utility; (b) this correlation decreases with increasing context sizes (higher values of k in k-shot); and (c) a more effective retrieval model generally leads to better downstream RAG performance.

Paper Structure

This paper contains 1 section, 1 figure.

Table of Contents

  1. Introduction

Figures (1)

  • Figure 1: The illustration of the generator (LLM) with $k$ top-retrieved documents comprising a RAG context. Conducting experiments with an IR test collection enables the 'retriever' component to be evaluated with standard IR metrics using available ground truth of known relevance assessments. The 'generator' component is evaluated by computing the similarity of a generated answer's embedding w.r.t. the embeddings of the judged relevant documents (shown as the green triangles). For instance, generated answer $\mathbf{a}$ has a shorter distance to its closest relevant document than $\mathbf{a}'$ does, so $\mathbf{a}$ gets a higher $F_{\text{BERT}}$ than $\mathbf{a}'$.