Table of Contents
Fetching ...

Evaluating Open-Domain Question Answering in the Era of Large Language Models

Ehsan Kamalloo, Nouha Dziri, Charles L. A. Clarke, Davood Rafiei

TL;DR

Open-domain QA evaluation patterns rely on lexical matching, which severely underestimates true performance as generative outputs and semantic variation increase. The authors perform a thorough, human-guided re-evaluation on NQ-open and compare automated semantic similarity and prompting-based judgments, revealing large gaps between lexical metrics and human judgments, with InstructGPT variants approaching or surpassing prior top systems under human evaluation. They show that semantic equivalence and regex-based approaches capture many, but not all, variations in answers, especially for long-form content and attributability, highlighting persistent hallucination risks in automated evaluators. The work advocates incorporating human evaluation and linguistic analysis to build more reliable benchmarks for open-domain QA and guides the development of robust evaluation frameworks beyond exact-match metrics.

Abstract

Lexical matching remains the de facto evaluation method for open-domain question answering (QA). Unfortunately, lexical matching fails completely when a plausible candidate answer does not appear in the list of gold answers, which is increasingly the case as we shift from extractive to generative models. The recent success of large language models (LLMs) for QA aggravates lexical matching failures since candidate answers become longer, thereby making matching with the gold answers even more challenging. Without accurate evaluation, the true progress in open-domain QA remains unknown. In this paper, we conduct a thorough analysis of various open-domain QA models, including LLMs, by manually evaluating their answers on a subset of NQ-open, a popular benchmark. Our assessments reveal that while the true performance of all models is significantly underestimated, the performance of the InstructGPT (zero-shot) LLM increases by nearly +60%, making it on par with existing top models, and the InstructGPT (few-shot) model actually achieves a new state-of-the-art on NQ-open. We also find that more than 50% of lexical matching failures are attributed to semantically equivalent answers. We further demonstrate that regex matching ranks QA models consistent with human judgments, although still suffering from unnecessary strictness. Finally, we demonstrate that automated evaluation models are a reasonable surrogate for lexical matching in some circumstances, but not for long-form answers generated by LLMs. The automated models struggle in detecting hallucinations in LLM answers and are thus unable to evaluate LLMs. At this time, there appears to be no substitute for human evaluation.

Evaluating Open-Domain Question Answering in the Era of Large Language Models

TL;DR

Open-domain QA evaluation patterns rely on lexical matching, which severely underestimates true performance as generative outputs and semantic variation increase. The authors perform a thorough, human-guided re-evaluation on NQ-open and compare automated semantic similarity and prompting-based judgments, revealing large gaps between lexical metrics and human judgments, with InstructGPT variants approaching or surpassing prior top systems under human evaluation. They show that semantic equivalence and regex-based approaches capture many, but not all, variations in answers, especially for long-form content and attributability, highlighting persistent hallucination risks in automated evaluators. The work advocates incorporating human evaluation and linguistic analysis to build more reliable benchmarks for open-domain QA and guides the development of robust evaluation frameworks beyond exact-match metrics.

Abstract

Lexical matching remains the de facto evaluation method for open-domain question answering (QA). Unfortunately, lexical matching fails completely when a plausible candidate answer does not appear in the list of gold answers, which is increasingly the case as we shift from extractive to generative models. The recent success of large language models (LLMs) for QA aggravates lexical matching failures since candidate answers become longer, thereby making matching with the gold answers even more challenging. Without accurate evaluation, the true progress in open-domain QA remains unknown. In this paper, we conduct a thorough analysis of various open-domain QA models, including LLMs, by manually evaluating their answers on a subset of NQ-open, a popular benchmark. Our assessments reveal that while the true performance of all models is significantly underestimated, the performance of the InstructGPT (zero-shot) LLM increases by nearly +60%, making it on par with existing top models, and the InstructGPT (few-shot) model actually achieves a new state-of-the-art on NQ-open. We also find that more than 50% of lexical matching failures are attributed to semantically equivalent answers. We further demonstrate that regex matching ranks QA models consistent with human judgments, although still suffering from unnecessary strictness. Finally, we demonstrate that automated evaluation models are a reasonable surrogate for lexical matching in some circumstances, but not for long-form answers generated by LLMs. The automated models struggle in detecting hallucinations in LLM answers and are thus unable to evaluate LLMs. At this time, there appears to be no substitute for human evaluation.
Paper Structure (31 sections, 7 figures, 2 tables)

This paper contains 31 sections, 7 figures, 2 tables.

Figures (7)

  • Figure 1: Examples of failures in open-domain QA evaluation. Top:Jicheng is a credible answer although not present in the list of gold answers. Existing automated evaluation mechanisms fail to identify it as correct. Bottom: A seemingly correct but unattributable answer from InstructGPT instructgpt for which automatic evaluation goes astray.
  • Figure 2: Accuracy of 12 open-domain QA models on the NQ-open subset of 301 questions using EM (purple points) and the three evaluation mechanisms (green points). For LLMs, the ranking of models under BEM and InstructGPT-eval is not consistent with human evaluation, while the rest of the models are ranked similarly under the two evaluation method. InstructGPT (few shot) outperforms other models only under human assessment.
  • Figure 3: Statistics of exact-match failure modes determined via our linguistic analysis
  • Figure 4: Percentage of high-level failure modes for each evaluation method on NQ-open.
  • Figure 5: Accuracy of several open-domain QA models on CuratedTREC 2002, computed via regex matching, along with the results of three evaluation mechanisms. Purple points represent the EM accuracy, and green points depict accuracy achieved via BEM, InstructGPT-eval, and human judgment. Classic statistical models from TREC QA 2002 are shown as orange stars. InstructGPT (few shot) outperforms the best of these classic models only under human assessment.
  • ...and 2 more figures