Table of Contents
Fetching ...

Towards Global Retrieval Augmented Generation: A Benchmark for Corpus-Level Reasoning

Qi Luo, Xiaonan Li, Tingshuo Fan, Xinchi Chen, Xipeng Qiu

TL;DR

The paper identifies a gap in retrieval-augmented generation: current systems struggle with corpus-level reasoning across large document collections. It introduces GlobalQA, a benchmark designed to evaluate global RAG across four task types (Counting, Extremum, Sorting, Top-$k$) and reveals that existing methods achieve only $F1$ around 1.5; to address this, it proposes GlobalRAG, a three-stage framework combining document-level retrieval, an LLM-driven filter, and task-specific aggregation tools, which achieves $F1$ of about $6.63$ on a $14$B model. The results show significant gains over baselines and demonstrate the importance of preserving document integrity, filtering noise, and applying symbolic computation for corpus-wide tasks. The dataset and framework offer practical insights for building scalable, reliable global RAG systems with real-world impact in knowledge-intensive applications.

Abstract

Retrieval-augmented generation (RAG) has emerged as a leading approach to reducing hallucinations in large language models (LLMs). Current RAG evaluation benchmarks primarily focus on what we call local RAG: retrieving relevant chunks from a small subset of documents to answer queries that require only localized understanding within specific text chunks. However, many real-world applications require a fundamentally different capability -- global RAG -- which involves aggregating and analyzing information across entire document collections to derive corpus-level insights (for example, "What are the top 10 most cited papers in 2023?"). In this paper, we introduce GlobalQA -- the first benchmark specifically designed to evaluate global RAG capabilities, covering four core task types: counting, extremum queries, sorting, and top-k extraction. Through systematic evaluation across different models and baselines, we find that existing RAG methods perform poorly on global tasks, with the strongest baseline achieving only 1.51 F1 score. To address these challenges, we propose GlobalRAG, a multi-tool collaborative framework that preserves structural coherence through chunk-level retrieval, incorporates LLM-driven intelligent filters to eliminate noisy documents, and integrates aggregation modules for precise symbolic computation. On the Qwen2.5-14B model, GlobalRAG achieves 6.63 F1 compared to the strongest baseline's 1.51 F1, validating the effectiveness of our method.

Towards Global Retrieval Augmented Generation: A Benchmark for Corpus-Level Reasoning

TL;DR

The paper identifies a gap in retrieval-augmented generation: current systems struggle with corpus-level reasoning across large document collections. It introduces GlobalQA, a benchmark designed to evaluate global RAG across four task types (Counting, Extremum, Sorting, Top-) and reveals that existing methods achieve only around 1.5; to address this, it proposes GlobalRAG, a three-stage framework combining document-level retrieval, an LLM-driven filter, and task-specific aggregation tools, which achieves of about on a B model. The results show significant gains over baselines and demonstrate the importance of preserving document integrity, filtering noise, and applying symbolic computation for corpus-wide tasks. The dataset and framework offer practical insights for building scalable, reliable global RAG systems with real-world impact in knowledge-intensive applications.

Abstract

Retrieval-augmented generation (RAG) has emerged as a leading approach to reducing hallucinations in large language models (LLMs). Current RAG evaluation benchmarks primarily focus on what we call local RAG: retrieving relevant chunks from a small subset of documents to answer queries that require only localized understanding within specific text chunks. However, many real-world applications require a fundamentally different capability -- global RAG -- which involves aggregating and analyzing information across entire document collections to derive corpus-level insights (for example, "What are the top 10 most cited papers in 2023?"). In this paper, we introduce GlobalQA -- the first benchmark specifically designed to evaluate global RAG capabilities, covering four core task types: counting, extremum queries, sorting, and top-k extraction. Through systematic evaluation across different models and baselines, we find that existing RAG methods perform poorly on global tasks, with the strongest baseline achieving only 1.51 F1 score. To address these challenges, we propose GlobalRAG, a multi-tool collaborative framework that preserves structural coherence through chunk-level retrieval, incorporates LLM-driven intelligent filters to eliminate noisy documents, and integrates aggregation modules for precise symbolic computation. On the Qwen2.5-14B model, GlobalRAG achieves 6.63 F1 compared to the strongest baseline's 1.51 F1, validating the effectiveness of our method.

Paper Structure

This paper contains 41 sections, 7 equations, 5 figures, 6 tables.

Figures (5)

  • Figure 1: Why Dense Retriever Fails on Global Queries: (a) Local Query: The answer can be found in specific documents. Dense retriever ranks all documents and selects top-k, which contains the relevant information. (b) Global Query: The answer requires information from all documents. However, dense retriever only returns top-k ranked documents, missing critical information scattered across the entire corpus.
  • Figure 2: GlobalQA benchmark overview: construction pipeline (left), task examples with various complexity (top-right), and evaluation metrics (bottom-right).
  • Figure 3: Statistical analysis of the GlobalQA dataset. Left: distribution of task types. Middle: distribution of the number of documents per query. Right: keyword distribution.
  • Figure 4: F1/D-F1@20 trends of GlobalRAG and IRCOT (baseline) under different retrieval steps.
  • Figure 5: F1/D-F1@k trends of GlobalRAG and IRCOT (baseline) under different retrieval numbers (Top-K).