Table of Contents
Fetching ...

Long$^2$RAG: Evaluating Long-Context & Long-Form Retrieval-Augmented Generation with Key Point Recall

Zehan Qi, Rongwu Xu, Zhijiang Guo, Cunxiang Wang, Hao Zhang, Wei Xu

TL;DR

Long$^2$RAG introduces a realism-grounded benchmark for evaluating long-context and long-form retrieval-augmented generation. By pairing 280 complex questions with 5 real-world retrieved documents per question and a novel Key Point Recall (KPR) metric, the work shifts evaluation toward how effectively models exploit external knowledge rather than merely generating fluent text. The dataset construction combines automated key-point extraction with human verification, yielding 2,055 ground-truth key points and enabling nuanced cross-model, cross-domain analyses. Across 9 LLMs, the study reveals consistent advantages for large API models like GPT-4o, demonstrates the adverse impact of long document lengths and truncation, and shows how generation length and evaluator choice influence KPR and related metrics, offering practical guidance for designing and assessing long-context RAG systems.

Abstract

Retrieval-augmented generation (RAG) is a promising approach to address the limitations of fixed knowledge in large language models (LLMs). However, current benchmarks for evaluating RAG systems suffer from two key deficiencies: (1) they fail to adequately measure LLMs' capability in handling long-context retrieval due to a lack of datasets that reflect the characteristics of retrieved documents, and (2) they lack a comprehensive evaluation method for assessing LLMs' ability to generate long-form responses that effectively exploits retrieved information. To address these shortcomings, we introduce the Long$^2$RAG benchmark and the Key Point Recall (KPR) metric. Long$^2$RAG comprises 280 questions spanning 10 domains and across 8 question categories, each associated with 5 retrieved documents with an average length of 2,444 words. KPR evaluates the extent to which LLMs incorporate key points extracted from the retrieved documents into their generated responses, providing a more nuanced assessment of their ability to exploit retrieved information.

Long$^2$RAG: Evaluating Long-Context & Long-Form Retrieval-Augmented Generation with Key Point Recall

TL;DR

LongRAG introduces a realism-grounded benchmark for evaluating long-context and long-form retrieval-augmented generation. By pairing 280 complex questions with 5 real-world retrieved documents per question and a novel Key Point Recall (KPR) metric, the work shifts evaluation toward how effectively models exploit external knowledge rather than merely generating fluent text. The dataset construction combines automated key-point extraction with human verification, yielding 2,055 ground-truth key points and enabling nuanced cross-model, cross-domain analyses. Across 9 LLMs, the study reveals consistent advantages for large API models like GPT-4o, demonstrates the adverse impact of long document lengths and truncation, and shows how generation length and evaluator choice influence KPR and related metrics, offering practical guidance for designing and assessing long-context RAG systems.

Abstract

Retrieval-augmented generation (RAG) is a promising approach to address the limitations of fixed knowledge in large language models (LLMs). However, current benchmarks for evaluating RAG systems suffer from two key deficiencies: (1) they fail to adequately measure LLMs' capability in handling long-context retrieval due to a lack of datasets that reflect the characteristics of retrieved documents, and (2) they lack a comprehensive evaluation method for assessing LLMs' ability to generate long-form responses that effectively exploits retrieved information. To address these shortcomings, we introduce the LongRAG benchmark and the Key Point Recall (KPR) metric. LongRAG comprises 280 questions spanning 10 domains and across 8 question categories, each associated with 5 retrieved documents with an average length of 2,444 words. KPR evaluates the extent to which LLMs incorporate key points extracted from the retrieved documents into their generated responses, providing a more nuanced assessment of their ability to exploit retrieved information.

Paper Structure

This paper contains 31 sections, 3 equations, 18 figures, 4 tables.

Figures (18)

  • Figure 1: Illustration of the RAG and evaluation pipelines using KPR. We first extract the key points from the retrieved documents and compute the recall of these points in the response of the LLM with the help of an Evaluator (possibly another LLM), thereby enabling the evaluation of the response quality.
  • Figure 2: Overview of our dataset construction pipeline. The process comprises two main stages. In the first stage, we aim to generate uncontaminated questions by employing an LLM to filter questions from ELI5 and construct a seed question pool. By using two evolving techniques, new questions are generated. In the second stage, a search engine is utilized to procure documents for the RAG pipeline, where the key points are extracted automatically afterward. We finally employ a human-LLM collaborated verification task that result in our final dataset.
  • Figure 3: Detailed information about our defined question categories, including definitions and examples.
  • Figure 4: KPR of LLMs on different domains.
  • Figure 5: KPR of LLMs on different question categories.
  • ...and 13 more figures