LFQA-E: Carefully Benchmarking Long-form QA Evaluation

Yuchen Fan; Chen Lin; Xin Zhong; Shuo Zhang; Heng Zhou; Yuchen Zhang; Mingyu Liang; Chengxing Xie; Ermo Hua; Gang Chen; Zhizhou He; Cheng Huang; Ning Ding; Bowen Zhou

LFQA-E: Carefully Benchmarking Long-form QA Evaluation

Yuchen Fan, Chen Lin, Xin Zhong, Shuo Zhang, Heng Zhou, Yuchen Zhang, Mingyu Liang, Chengxing Xie, Ermo Hua, Gang Chen, Zhizhou He, Cheng Huang, Ning Ding, Bowen Zhou

TL;DR

LFQA-E introduces a multilingual, reference-based benchmark for long-form QA evaluation, addressing the absence of grounded references and limited topic coverage in prior work. By compiling $1618$ questions and $7323$ pairwise comparisons across $15$ topics in English and Chinese, with expert references and three comparison settings (human-human, human-model, model-model), LFQA-E enables rigorous benchmarking of automatic metrics. Across 17 methods, results show no automatic metric reaches human judgment performance, revealing substantial gaps in current evaluation approaches for dense, long-form responses. The paper analyzes failure modes, cross-language generalization, and demonstrates that specialized tuned models and reinforcement-learning-based evaluation strategies can offer gains, guiding future development of LFQA evaluation pipelines. It also provides insights into data contamination, prompt design, and the tradeoffs between cost and accuracy for practical evaluation workflows.

Abstract

Long-Form Question Answering (LFQA) involves generating comprehensive, paragraph-level responses to open-ended questions, which poses a significant challenge for evaluation due to the richness of information and flexible response format. Existing LFQA-evaluation benchmarks often lack reference answers and are limited in size and topic coverage, reducing their reliability. To address this gap, we introduce LFQA-E, a well-constructed, multilingual, and reference-based benchmark designed to rigorously evaluate automatic metrics for LFQA. LFQA-E comprises 1618 questions and 7323 pairwise comparisons across 15 topics, drawn from diverse sources such as online queries and examination questions, thereby enabling a comprehensive assessment of evaluation metrics. We examine five categories of metrics, encompassing 17 specific methods, using LFQA-E. The results demonstrate that none of the existing automatic metrics perform comparably to human judgments, highlighting their inability to capture the dense information in long-form responses. Furthermore, we present a detailed analysis of the failure cases and the generalization capacity of these metrics, offering insights to guide the future development of LFQA evaluation methods. The benchmark and code are available at https://github.com/YuchenFan48/LFQA-E.

LFQA-E: Carefully Benchmarking Long-form QA Evaluation

TL;DR

LFQA-E introduces a multilingual, reference-based benchmark for long-form QA evaluation, addressing the absence of grounded references and limited topic coverage in prior work. By compiling

questions and

pairwise comparisons across

topics in English and Chinese, with expert references and three comparison settings (human-human, human-model, model-model), LFQA-E enables rigorous benchmarking of automatic metrics. Across 17 methods, results show no automatic metric reaches human judgment performance, revealing substantial gaps in current evaluation approaches for dense, long-form responses. The paper analyzes failure modes, cross-language generalization, and demonstrates that specialized tuned models and reinforcement-learning-based evaluation strategies can offer gains, guiding future development of LFQA evaluation pipelines. It also provides insights into data contamination, prompt design, and the tradeoffs between cost and accuracy for practical evaluation workflows.

Abstract

Paper Structure (57 sections, 2 equations, 5 figures, 30 tables)

This paper contains 57 sections, 2 equations, 5 figures, 30 tables.

Introduction
Related Work
Development of LFQA
Evaluation of LFQA
Methodology
Overview
Reference-Based Evaluation
Difficult Comparisons
Diverse Benchmark
Data Processing
Data Collection
Human Response Collection
Model Response Generation
Human Annotation
Annotator Decision
...and 42 more sections

Figures (5)

Figure 1: The figure shows the overview of LFQA-E. The left side displays the categories, sources, and three settings, showcasing its diversity. The right side illustrates an example of LFQA-E.
Figure 2: Performance of different models on our three settings on LFQA-E.
Figure 3: Percentage of error types for LMs on the LFQA-E dataset.
Figure 4: The Cohen's Kappa Correlation Matrix in LFQA-E.
Figure 5: The annotation pipeline.

LFQA-E: Carefully Benchmarking Long-form QA Evaluation

TL;DR

Abstract

LFQA-E: Carefully Benchmarking Long-form QA Evaluation

Authors

TL;DR

Abstract

Table of Contents

Figures (5)