Table of Contents
Fetching ...

An Empirical Study of Evaluating Long-form Question Answering

Ning Xian, Yixing Fan, Ruqing Zhang, Maarten de Rijke, Jiafeng Guo

TL;DR

This work addresses the challenge of evaluating long-form question answering (LFQA) by comparing traditional deterministic metrics with modern LLM-based evaluators across ASQA, ANTIQUE, and WikiEval datasets using seven diverse LLMs. It combines human annotations (correctness and informativeness) with meta-evaluations (correlations, win rates, and agreement) to assess accuracy, robustness, and fairness of metrics, including prompt and hyper-parameter perturbations. The findings show that LLM-based evaluators, especially fine-grained GPT-4o-based judgments, align more closely with human judgments and are more robust across LFQA types, though they suffer from biases such as length effects and self-reinforcement. The study proposes leveraging a mix of metrics and carefully designed prompting strategies to improve reliability, emphasizes dataset-specific considerations, and provides open-source code and data for reproducibility and further research.

Abstract

\Ac{LFQA} aims to generate lengthy answers to complex questions. This scenario presents great flexibility as well as significant challenges for evaluation. Most evaluations rely on deterministic metrics that depend on string or n-gram matching, while the reliability of large language model-based evaluations for long-form answers remains relatively unexplored. We address this gap by conducting an in-depth study of long-form answer evaluation with the following research questions: (i) To what extent do existing automatic evaluation metrics serve as a substitute for human evaluations? (ii) What are the limitations of existing evaluation metrics compared to human evaluations? (iii) How can the effectiveness and robustness of existing evaluation methods be improved? We collect 5,236 factoid and non-factoid long-form answers generated by different large language models and conduct a human evaluation on 2,079 of them, focusing on correctness and informativeness. Subsequently, we investigated the performance of automatic evaluation metrics by evaluating these answers, analyzing the consistency between these metrics and human evaluations. We find that the style, length of the answers, and the category of questions can bias the automatic evaluation metrics. However, fine-grained evaluation helps mitigate this issue on some metrics. Our findings have important implications for the use of large language models for evaluating long-form question answering. All code and datasets are available at https://github.com/bugtig6351/lfqa_evaluation.

An Empirical Study of Evaluating Long-form Question Answering

TL;DR

This work addresses the challenge of evaluating long-form question answering (LFQA) by comparing traditional deterministic metrics with modern LLM-based evaluators across ASQA, ANTIQUE, and WikiEval datasets using seven diverse LLMs. It combines human annotations (correctness and informativeness) with meta-evaluations (correlations, win rates, and agreement) to assess accuracy, robustness, and fairness of metrics, including prompt and hyper-parameter perturbations. The findings show that LLM-based evaluators, especially fine-grained GPT-4o-based judgments, align more closely with human judgments and are more robust across LFQA types, though they suffer from biases such as length effects and self-reinforcement. The study proposes leveraging a mix of metrics and carefully designed prompting strategies to improve reliability, emphasizes dataset-specific considerations, and provides open-source code and data for reproducibility and further research.

Abstract

\Ac{LFQA} aims to generate lengthy answers to complex questions. This scenario presents great flexibility as well as significant challenges for evaluation. Most evaluations rely on deterministic metrics that depend on string or n-gram matching, while the reliability of large language model-based evaluations for long-form answers remains relatively unexplored. We address this gap by conducting an in-depth study of long-form answer evaluation with the following research questions: (i) To what extent do existing automatic evaluation metrics serve as a substitute for human evaluations? (ii) What are the limitations of existing evaluation metrics compared to human evaluations? (iii) How can the effectiveness and robustness of existing evaluation methods be improved? We collect 5,236 factoid and non-factoid long-form answers generated by different large language models and conduct a human evaluation on 2,079 of them, focusing on correctness and informativeness. Subsequently, we investigated the performance of automatic evaluation metrics by evaluating these answers, analyzing the consistency between these metrics and human evaluations. We find that the style, length of the answers, and the category of questions can bias the automatic evaluation metrics. However, fine-grained evaluation helps mitigate this issue on some metrics. Our findings have important implications for the use of large language models for evaluating long-form question answering. All code and datasets are available at https://github.com/bugtig6351/lfqa_evaluation.

Paper Structure

This paper contains 21 sections, 6 figures, 7 tables.

Figures (6)

  • Figure 1: Interface used for collecting human annotations.
  • Figure 2: Score distribution of different prompts on ASQA.
  • Figure 3: Relationship between answer length and metrics.
  • Figure 4: Relationship between different metrics and human evaluations across different question types on the ANTIQUE dataset (left: Informativeness, right: Correctness).
  • Figure 5: Relationship between the IDF of answer and different metrics.
  • ...and 1 more figures