More Documents, Same Length: Isolating the Challenge of Multiple Documents in RAG
Shahar Levy, Nir Mazor, Lihi Shalmon, Michael Hassid, Gabriel Stanovsky
TL;DR
The paper probes how the number of retrieved documents impacts RAG performance when the input token budget is fixed, separating multi-document processing from long-context length. It constructs controlled multi-hop QA datasets from MuSiQue and 2WikiMultiHopQA and creates document-count partitions by expanding or selecting documents to keep token length constant. Across six instruction-tuned models, increasing document count generally degrades performance (up to ~20%), with Qwen-2.5 showing robustness to document quantity. The findings inform practical RAG design by highlighting the need to balance document quantity with relevance and diversity, and they point to future work on multi-document processing techniques to mitigate cross-document conflicts.
Abstract
Retrieval-Augmented Generation (RAG) enhances the accuracy of Large Language Model (LLM) responses by leveraging relevant external documents during generation. Although previous studies noted that retrieving many documents can degrade performance, they did not isolate how the quantity of documents affects performance while controlling for context length. We evaluate various language models on custom datasets derived from a multi-hop QA task. We keep the context length and position of relevant information constant while varying the number of documents, and find that increasing the document count in RAG settings poses significant challenges for most LLMs, reducing performance by up to 20%. However, Qwen2.5 maintained consistent results across increasing document counts, indicating better multi-document handling capability. Finally, our results indicate that processing multiple documents is a separate challenge from handling long contexts. We also make the datasets and code available: https://github.com/shaharl6000/MoreDocsSameLen .
