Table of Contents
Fetching ...

RAG-QA Arena: Evaluating Domain Robustness for Long-form Retrieval Augmented Question Answering

Rujun Han, Yuhao Zhang, Peng Qi, Yumo Xu, Jenyuan Wang, Lan Liu, William Yang Wang, Bonan Min, Vittorio Castelli

TL;DR

Long-form RobustQA (LFRQA), a new dataset comprising human-written long-form answers that integrate short extractive answers from multiple documents into a single, coherent narrative, is created, covering 26K queries and large corpora across seven different domains.

Abstract

Question answering based on retrieval augmented generation (RAG-QA) is an important research topic in NLP and has a wide range of real-world applications. However, most existing datasets for this task are either constructed using a single source corpus or consist of short extractive answers, which fall short of evaluating large language model (LLM) based RAG-QA systems on cross-domain generalization. To address these limitations, we create Long-form RobustQA (LFRQA), a new dataset comprising human-written long-form answers that integrate short extractive answers from multiple documents into a single, coherent narrative, covering 26K queries and large corpora across seven different domains. We further propose RAG-QA Arena by directly comparing model-generated answers against LFRQA's answers using LLMs as evaluators. We show via extensive experiments that RAG-QA Arena and human judgments on answer quality are highly correlated. Moreover, only 41.3% of the most competitive LLM's answers are preferred to LFRQA's answers, demonstrating RAG-QA Arena as a challenging evaluation platform for future research.

RAG-QA Arena: Evaluating Domain Robustness for Long-form Retrieval Augmented Question Answering

TL;DR

Long-form RobustQA (LFRQA), a new dataset comprising human-written long-form answers that integrate short extractive answers from multiple documents into a single, coherent narrative, is created, covering 26K queries and large corpora across seven different domains.

Abstract

Question answering based on retrieval augmented generation (RAG-QA) is an important research topic in NLP and has a wide range of real-world applications. However, most existing datasets for this task are either constructed using a single source corpus or consist of short extractive answers, which fall short of evaluating large language model (LLM) based RAG-QA systems on cross-domain generalization. To address these limitations, we create Long-form RobustQA (LFRQA), a new dataset comprising human-written long-form answers that integrate short extractive answers from multiple documents into a single, coherent narrative, covering 26K queries and large corpora across seven different domains. We further propose RAG-QA Arena by directly comparing model-generated answers against LFRQA's answers using LLMs as evaluators. We show via extensive experiments that RAG-QA Arena and human judgments on answer quality are highly correlated. Moreover, only 41.3% of the most competitive LLM's answers are preferred to LFRQA's answers, demonstrating RAG-QA Arena as a challenging evaluation platform for future research.
Paper Structure (30 sections, 9 figures, 15 tables)

This paper contains 30 sections, 9 figures, 15 tables.

Figures (9)

  • Figure 1: Lfrqa annotation example. There are three documents (some text removed for brevity) relevant to the query. We instruct annotators to combine Robustqa's answers into a coherent long-form answer with added text if necessary. Citations [1], [2] and [3] indicate the supporting documents of each sentence.
  • Figure 2: Lfrqa v.s. Robustqa. Citations are removed in Lfrqa's answers, and a few answer spans are removed for clarity. Green and orange texts represent positive and negative opinions, respectively.
  • Figure 3: Distribution of number (#) of documents used in Lfrqa's answers. All numbers are %.
  • Figure 4: Rag-qa Arena framework. Green blocks are LLM's inputs to generate answers. Orange blocks are LLM and LFRQA's answers presented to both human and LLM judges to determine pairwise preferences.
  • Figure 5: Annotation Interface
  • ...and 4 more figures