Towards Global Retrieval Augmented Generation: A Benchmark for Corpus-Level Reasoning
Qi Luo, Xiaonan Li, Tingshuo Fan, Xinchi Chen, Xipeng Qiu
TL;DR
The paper identifies a gap in retrieval-augmented generation: current systems struggle with corpus-level reasoning across large document collections. It introduces GlobalQA, a benchmark designed to evaluate global RAG across four task types (Counting, Extremum, Sorting, Top-$k$) and reveals that existing methods achieve only $F1$ around 1.5; to address this, it proposes GlobalRAG, a three-stage framework combining document-level retrieval, an LLM-driven filter, and task-specific aggregation tools, which achieves $F1$ of about $6.63$ on a $14$B model. The results show significant gains over baselines and demonstrate the importance of preserving document integrity, filtering noise, and applying symbolic computation for corpus-wide tasks. The dataset and framework offer practical insights for building scalable, reliable global RAG systems with real-world impact in knowledge-intensive applications.
Abstract
Retrieval-augmented generation (RAG) has emerged as a leading approach to reducing hallucinations in large language models (LLMs). Current RAG evaluation benchmarks primarily focus on what we call local RAG: retrieving relevant chunks from a small subset of documents to answer queries that require only localized understanding within specific text chunks. However, many real-world applications require a fundamentally different capability -- global RAG -- which involves aggregating and analyzing information across entire document collections to derive corpus-level insights (for example, "What are the top 10 most cited papers in 2023?"). In this paper, we introduce GlobalQA -- the first benchmark specifically designed to evaluate global RAG capabilities, covering four core task types: counting, extremum queries, sorting, and top-k extraction. Through systematic evaluation across different models and baselines, we find that existing RAG methods perform poorly on global tasks, with the strongest baseline achieving only 1.51 F1 score. To address these challenges, we propose GlobalRAG, a multi-tool collaborative framework that preserves structural coherence through chunk-level retrieval, incorporates LLM-driven intelligent filters to eliminate noisy documents, and integrates aggregation modules for precise symbolic computation. On the Qwen2.5-14B model, GlobalRAG achieves 6.63 F1 compared to the strongest baseline's 1.51 F1, validating the effectiveness of our method.
