Table of Contents
Fetching ...

CLAPNQ: Cohesive Long-form Answers from Passages in Natural Questions for RAG systems

Sara Rosenthal, Avirup Sil, Radu Florian, Salim Roukos

TL;DR

CLAPnQ introduces a grounded long-form QA benchmark designed for the full RAG pipeline, derived from a subset of Natural Questions to ensure long, cohesive answers anchored to a single gold passage. It provides a retrieval corpus and structured evaluation across Retrieval, Generation, and end-to-end RAG, highlighting gaps in faithfulness and conciseness of current models. The work reports strong baselines for retrieval (notably E5-base variants) and a fine-tuned encoder–decoder (CLAPnq-T5-lg) for grounded generation, while large decoder LLMs exhibit longer, less faithful outputs. Through comprehensive human evaluation and cross-dataset analysis (with ASQA), the paper identifies key challenges and actionable directions for improving grounded LFQA systems in real-world RAG applications.

Abstract

Retrieval Augmented Generation (RAG) has become a popular application for large language models. It is preferable that successful RAG systems provide accurate answers that are supported by being grounded in a passage without any hallucinations. While considerable work is required for building a full RAG pipeline, being able to benchmark performance is also necessary. We present ClapNQ, a benchmark Long-form Question Answering dataset for the full RAG pipeline. ClapNQ includes long answers with grounded gold passages from Natural Questions (NQ) and a corpus to perform either retrieval, generation, or the full RAG pipeline. The ClapNQ answers are concise, 3x smaller than the full passage, and cohesive, meaning that the answer is composed fluently, often by integrating multiple pieces of the passage that are not contiguous. RAG models must adapt to these properties to be successful at ClapNQ. We present baseline experiments and analysis for ClapNQ that highlight areas where there is still significant room for improvement in grounded RAG. CLAPNQ is publicly available at https://github.com/primeqa/clapnq

CLAPNQ: Cohesive Long-form Answers from Passages in Natural Questions for RAG systems

TL;DR

CLAPnQ introduces a grounded long-form QA benchmark designed for the full RAG pipeline, derived from a subset of Natural Questions to ensure long, cohesive answers anchored to a single gold passage. It provides a retrieval corpus and structured evaluation across Retrieval, Generation, and end-to-end RAG, highlighting gaps in faithfulness and conciseness of current models. The work reports strong baselines for retrieval (notably E5-base variants) and a fine-tuned encoder–decoder (CLAPnq-T5-lg) for grounded generation, while large decoder LLMs exhibit longer, less faithful outputs. Through comprehensive human evaluation and cross-dataset analysis (with ASQA), the paper identifies key challenges and actionable directions for improving grounded LFQA systems in real-world RAG applications.

Abstract

Retrieval Augmented Generation (RAG) has become a popular application for large language models. It is preferable that successful RAG systems provide accurate answers that are supported by being grounded in a passage without any hallucinations. While considerable work is required for building a full RAG pipeline, being able to benchmark performance is also necessary. We present ClapNQ, a benchmark Long-form Question Answering dataset for the full RAG pipeline. ClapNQ includes long answers with grounded gold passages from Natural Questions (NQ) and a corpus to perform either retrieval, generation, or the full RAG pipeline. The ClapNQ answers are concise, 3x smaller than the full passage, and cohesive, meaning that the answer is composed fluently, often by integrating multiple pieces of the passage that are not contiguous. RAG models must adapt to these properties to be successful at ClapNQ. We present baseline experiments and analysis for ClapNQ that highlight areas where there is still significant room for improvement in grounded RAG. CLAPNQ is publicly available at https://github.com/primeqa/clapnq
Paper Structure (23 sections, 5 figures, 17 tables)

This paper contains 23 sections, 5 figures, 17 tables.

Figures (5)

  • Figure 1: CLAPnq is designed to test all parts of the RAG pipeline: Retrieval, Generation with gold passages, and the full RAG setup with generation on retrieved passages.
  • Figure 2: The Round 1 annotation task for CLAPnq. The annotator had to select the title/sentences needed to answer the question, and then provide a concise answer.
  • Figure 3: The Round 2 annotation task for CLAPnq. The annotator had to verify and update the answer provided in Round 1 if needed. They also had to provide how they edited the answer.
  • Figure 4: The human evaluation task used to compare the model answers in random order. The individual questions per answer are shown here for one model.
  • Figure 5: The human evaluation task used to compare the model answers in random order. The head-to-head comparison for win-rate is shown here.