Table of Contents
Fetching ...

More Documents, Same Length: Isolating the Challenge of Multiple Documents in RAG

Shahar Levy, Nir Mazor, Lihi Shalmon, Michael Hassid, Gabriel Stanovsky

TL;DR

The paper probes how the number of retrieved documents impacts RAG performance when the input token budget is fixed, separating multi-document processing from long-context length. It constructs controlled multi-hop QA datasets from MuSiQue and 2WikiMultiHopQA and creates document-count partitions by expanding or selecting documents to keep token length constant. Across six instruction-tuned models, increasing document count generally degrades performance (up to ~20%), with Qwen-2.5 showing robustness to document quantity. The findings inform practical RAG design by highlighting the need to balance document quantity with relevance and diversity, and they point to future work on multi-document processing techniques to mitigate cross-document conflicts.

Abstract

Retrieval-Augmented Generation (RAG) enhances the accuracy of Large Language Model (LLM) responses by leveraging relevant external documents during generation. Although previous studies noted that retrieving many documents can degrade performance, they did not isolate how the quantity of documents affects performance while controlling for context length. We evaluate various language models on custom datasets derived from a multi-hop QA task. We keep the context length and position of relevant information constant while varying the number of documents, and find that increasing the document count in RAG settings poses significant challenges for most LLMs, reducing performance by up to 20%. However, Qwen2.5 maintained consistent results across increasing document counts, indicating better multi-document handling capability. Finally, our results indicate that processing multiple documents is a separate challenge from handling long contexts. We also make the datasets and code available: https://github.com/shaharl6000/MoreDocsSameLen .

More Documents, Same Length: Isolating the Challenge of Multiple Documents in RAG

TL;DR

The paper probes how the number of retrieved documents impacts RAG performance when the input token budget is fixed, separating multi-document processing from long-context length. It constructs controlled multi-hop QA datasets from MuSiQue and 2WikiMultiHopQA and creates document-count partitions by expanding or selecting documents to keep token length constant. Across six instruction-tuned models, increasing document count generally degrades performance (up to ~20%), with Qwen-2.5 showing robustness to document quantity. The findings inform practical RAG design by highlighting the need to balance document quantity with relevance and diversity, and they point to future work on multi-document processing techniques to mitigate cross-document conflicts.

Abstract

Retrieval-Augmented Generation (RAG) enhances the accuracy of Large Language Model (LLM) responses by leveraging relevant external documents during generation. Although previous studies noted that retrieving many documents can degrade performance, they did not isolate how the quantity of documents affects performance while controlling for context length. We evaluate various language models on custom datasets derived from a multi-hop QA task. We keep the context length and position of relevant information constant while varying the number of documents, and find that increasing the document count in RAG settings poses significant challenges for most LLMs, reducing performance by up to 20%. However, Qwen2.5 maintained consistent results across increasing document counts, indicating better multi-document handling capability. Finally, our results indicate that processing multiple documents is a separate challenge from handling long contexts. We also make the datasets and code available: https://github.com/shaharl6000/MoreDocsSameLen .

Paper Structure

This paper contains 20 sections, 5 figures, 3 tables.

Figures (5)

  • Figure 1: We create various sets containing the same questions but differing in the number of distractor documents. Each set includes a multi-hop question, all of the supporting documents that contain the information to answer the question (pink), and varying distractor documents (blue). We begin with full-document version (left) and then reduce the number of documents while maintaining a fixed context size. When fewer documents are used, the remaining documents are extended so that concatenating them yields the same total length.
  • Figure 2: Increasing the number of retrieved documents can hurt performance. In retrieval setups with fixed context windows, adding more documents could reduce performance by up to 10 percent. Two models (Llama-3.3 and Gemma-2) showed worse performance, while Qwen-2.5 remained unaffected. The smaller versions of the LLMs (7–9B) show a similar trend as their larger counterparts but the effect is weaker. The hues of the bars represent the amount of retrieved documents.
  • Figure 3: The effects of adding non-related documents. When adding irrelevant documents, LLMs' performance improves across models for MuSiQue while for 2WMHQA it produces significant degradation in performance.
  • Figure 4: Performance of previous model variants with increasing retrieved documents. We tested earlier model versions (Llama-3.1 and Qwen-2) in retrieval settings with fixed context windows while adding more documents. Our findings were consistent with the latest model versions. Llama-3.1 showed performance reductions of up to 10%, similar to Llama-3.3, while Qwen-2 remained unaffected, consistent with Qwen-2.5's behavior.
  • Figure 5: The effects of adding non-related documents for previous variants. Similarly to the latest variants, when adding irrelevant documents, the LLMs' performance improves across models for MuSiQue while for 2WikiMultiHopQA it produces significant degradation in performance.