Is Relevance Propagated from Retriever to Generator in RAG?
Fangzheng Tian, Debasis Ganguly, Craig Macdonald
TL;DR
The paper tackles whether topical relevance of retrieved documents in a RAG setup translates into usable knowledge for the downstream task. It introduces an IR-based evaluation framework that treats context usefulness as a counterfactual utility $U(\theta(q)_k)$, i.e., the relative improvement in downstream performance when including the retrieved context versus 0-shot. It uses an IR test collection (MS MARCO and TREC-DL) and compares lexical (BM25) and neural (MonoT5) retrievers across $k$-shot context sizes to correlate retriever relevance $nDCG@k$ with generated-answer quality $F_{\text{BERT}}$. The study finds a small positive correlation between relevance and utility that weakens as context size grows, with more effective retrievers yielding higher downstream performance, and demonstrates the proposed framework with publicly available code. The results suggest that relevance is not fully transmitted through the generator and motivate future task-aware retrieval models, with the framework enabling reproducible evaluation in RAG systems.
Abstract
Retrieval Augmented Generation (RAG) is a framework for incorporating external knowledge, usually in the form of a set of documents retrieved from a collection, as a part of a prompt to a large language model (LLM) to potentially improve the performance of a downstream task, such as question answering. Different from a standard retrieval task's objective of maximising the relevance of a set of top-ranked documents, a RAG system's objective is rather to maximise their total utility, where the utility of a document indicates whether including it as a part of the additional contextual information in an LLM prompt improves a downstream task. Existing studies investigate the role of the relevance of a RAG context for knowledge-intensive language tasks (KILT), where relevance essentially takes the form of answer containment. In contrast, in our work, relevance corresponds to that of topical overlap between a query and a document for an information seeking task. Specifically, we make use of an IR test collection to empirically investigate whether a RAG context comprised of topically relevant documents leads to improved downstream performance. Our experiments lead to the following findings: (a) there is a small positive correlation between relevance and utility; (b) this correlation decreases with increasing context sizes (higher values of k in k-shot); and (c) a more effective retrieval model generally leads to better downstream RAG performance.
