Table of Contents
Fetching ...

On the Influence of Context Size and Model Choice in Retrieval-Augmented Generation Systems

Juraj Vladika, Florian Matthes

TL;DR

Retrieval-Augmented Generation (RAG) enhances LLMs by incorporating external context to improve factuality and timeliness. The paper systematically studies how context size, retriever type (BM25 vs semantic search), and base LLMs affect long-form QA on BioASQ and QuoteSum, including open-domain retrieval. Findings show QA performance improves up to about 15 snippets, then saturates or declines, with domain-specific LLMs excelling in biomedical versus encyclopedic tasks and open retrieval remaining challenging. These insights offer practical guidelines for deploying RAG systems and highlight the ongoing need to balance precision, coverage, and noise in large-scale open-domain retrieval.

Abstract

Retrieval-augmented generation (RAG) has emerged as an approach to augment large language models (LLMs) by reducing their reliance on static knowledge and improving answer factuality. RAG retrieves relevant context snippets and generates an answer based on them. Despite its increasing industrial adoption, systematic exploration of RAG components is lacking, particularly regarding the ideal size of provided context, and the choice of base LLM and retrieval method. To help guide development of robust RAG systems, we evaluate various context sizes, BM25 and semantic search as retrievers, and eight base LLMs. Moving away from the usual RAG evaluation with short answers, we explore the more challenging long-form question answering in two domains, where a good answer has to utilize the entire context. Our findings indicate that final QA performance improves steadily with up to 15 snippets but stagnates or declines beyond that. Finally, we show that different general-purpose LLMs excel in the biomedical domain than the encyclopedic one, and that open-domain evidence retrieval in large corpora is challenging.

On the Influence of Context Size and Model Choice in Retrieval-Augmented Generation Systems

TL;DR

Retrieval-Augmented Generation (RAG) enhances LLMs by incorporating external context to improve factuality and timeliness. The paper systematically studies how context size, retriever type (BM25 vs semantic search), and base LLMs affect long-form QA on BioASQ and QuoteSum, including open-domain retrieval. Findings show QA performance improves up to about 15 snippets, then saturates or declines, with domain-specific LLMs excelling in biomedical versus encyclopedic tasks and open retrieval remaining challenging. These insights offer practical guidelines for deploying RAG systems and highlight the ongoing need to balance precision, coverage, and noise in large-scale open-domain retrieval.

Abstract

Retrieval-augmented generation (RAG) has emerged as an approach to augment large language models (LLMs) by reducing their reliance on static knowledge and improving answer factuality. RAG retrieves relevant context snippets and generates an answer based on them. Despite its increasing industrial adoption, systematic exploration of RAG components is lacking, particularly regarding the ideal size of provided context, and the choice of base LLM and retrieval method. To help guide development of robust RAG systems, we evaluate various context sizes, BM25 and semantic search as retrievers, and eight base LLMs. Moving away from the usual RAG evaluation with short answers, we explore the more challenging long-form question answering in two domains, where a good answer has to utilize the entire context. Our findings indicate that final QA performance improves steadily with up to 15 snippets but stagnates or declines beyond that. Finally, we show that different general-purpose LLMs excel in the biomedical domain than the encyclopedic one, and that open-domain evidence retrieval in large corpora is challenging.

Paper Structure

This paper contains 27 sections, 1 figure, 9 tables.

Figures (1)

  • Figure 1: The influence of the number of context snippets passed to the RAG system on the final performance (entailment score) on a biomedical task BioASQ-QA. The performance improves steadily for all models, to a differing extent, and then stagnates after saturation.