Evaluating Open-Domain Question Answering in the Era of Large Language Models
Ehsan Kamalloo, Nouha Dziri, Charles L. A. Clarke, Davood Rafiei
TL;DR
Open-domain QA evaluation patterns rely on lexical matching, which severely underestimates true performance as generative outputs and semantic variation increase. The authors perform a thorough, human-guided re-evaluation on NQ-open and compare automated semantic similarity and prompting-based judgments, revealing large gaps between lexical metrics and human judgments, with InstructGPT variants approaching or surpassing prior top systems under human evaluation. They show that semantic equivalence and regex-based approaches capture many, but not all, variations in answers, especially for long-form content and attributability, highlighting persistent hallucination risks in automated evaluators. The work advocates incorporating human evaluation and linguistic analysis to build more reliable benchmarks for open-domain QA and guides the development of robust evaluation frameworks beyond exact-match metrics.
Abstract
Lexical matching remains the de facto evaluation method for open-domain question answering (QA). Unfortunately, lexical matching fails completely when a plausible candidate answer does not appear in the list of gold answers, which is increasingly the case as we shift from extractive to generative models. The recent success of large language models (LLMs) for QA aggravates lexical matching failures since candidate answers become longer, thereby making matching with the gold answers even more challenging. Without accurate evaluation, the true progress in open-domain QA remains unknown. In this paper, we conduct a thorough analysis of various open-domain QA models, including LLMs, by manually evaluating their answers on a subset of NQ-open, a popular benchmark. Our assessments reveal that while the true performance of all models is significantly underestimated, the performance of the InstructGPT (zero-shot) LLM increases by nearly +60%, making it on par with existing top models, and the InstructGPT (few-shot) model actually achieves a new state-of-the-art on NQ-open. We also find that more than 50% of lexical matching failures are attributed to semantically equivalent answers. We further demonstrate that regex matching ranks QA models consistent with human judgments, although still suffering from unnecessary strictness. Finally, we demonstrate that automated evaluation models are a reasonable surrogate for lexical matching in some circumstances, but not for long-form answers generated by LLMs. The automated models struggle in detecting hallucinations in LLM answers and are thus unable to evaluate LLMs. At this time, there appears to be no substitute for human evaluation.
