Long$^2$RAG: Evaluating Long-Context & Long-Form Retrieval-Augmented Generation with Key Point Recall
Zehan Qi, Rongwu Xu, Zhijiang Guo, Cunxiang Wang, Hao Zhang, Wei Xu
TL;DR
Long$^2$RAG introduces a realism-grounded benchmark for evaluating long-context and long-form retrieval-augmented generation. By pairing 280 complex questions with 5 real-world retrieved documents per question and a novel Key Point Recall (KPR) metric, the work shifts evaluation toward how effectively models exploit external knowledge rather than merely generating fluent text. The dataset construction combines automated key-point extraction with human verification, yielding 2,055 ground-truth key points and enabling nuanced cross-model, cross-domain analyses. Across 9 LLMs, the study reveals consistent advantages for large API models like GPT-4o, demonstrates the adverse impact of long document lengths and truncation, and shows how generation length and evaluator choice influence KPR and related metrics, offering practical guidance for designing and assessing long-context RAG systems.
Abstract
Retrieval-augmented generation (RAG) is a promising approach to address the limitations of fixed knowledge in large language models (LLMs). However, current benchmarks for evaluating RAG systems suffer from two key deficiencies: (1) they fail to adequately measure LLMs' capability in handling long-context retrieval due to a lack of datasets that reflect the characteristics of retrieved documents, and (2) they lack a comprehensive evaluation method for assessing LLMs' ability to generate long-form responses that effectively exploits retrieved information. To address these shortcomings, we introduce the Long$^2$RAG benchmark and the Key Point Recall (KPR) metric. Long$^2$RAG comprises 280 questions spanning 10 domains and across 8 question categories, each associated with 5 retrieved documents with an average length of 2,444 words. KPR evaluates the extent to which LLMs incorporate key points extracted from the retrieved documents into their generated responses, providing a more nuanced assessment of their ability to exploit retrieved information.
