Accurate and Nuanced Open-QA Evaluation Through Textual Entailment
Peiran Yao, Denilson Barbosa
TL;DR
This work tackles the shortcomings of Open-QA evaluation, notably question ambiguity and limited semantic understanding, by proposing a learning-free textual entailment framework for judging Open-QA answers. It formalizes an Answer Hierarchy that classifies system outputs into sets like $A_{sup}$, $A_{inf}$, and their union, enabling partial and bonus marks based on the inference gap between system and gold answers. The approach, validated on Evouna-derived splits of NaturalQuestions ($NQ$) and TriviaQA ($TQ$), demonstrates higher alignment with human judgments (e.g., AUROC/accuracy) than traditional lexical or prompt-based evaluators, and even rivals finetuned evaluators when entailment is used as a feature. Importantly, the method can outperform prompt-engineering baselines in an out-of-the-box setting and offers a structured way to provide nuanced, partial credit for near-correct answers, though its applicability to more complex, multi-passage QA tasks remains an area for future work. Overall, entailment-based evaluation provides a robust, generalizable, and potentially training-signaling approach for Open-QA benchmarking with practical impact for model development and evaluation.
Abstract
Open-domain question answering (Open-QA) is a common task for evaluating large language models (LLMs). However, current Open-QA evaluations are criticized for the ambiguity in questions and the lack of semantic understanding in evaluators. Complex evaluators, powered by foundation models or LLMs and pertaining to semantic equivalence, still deviate from human judgments by a large margin. We propose to study the entailment relations of answers to identify more informative and more general system answers, offering a much closer evaluation to human judgment on both NaturalQuestions and TriviaQA while being learning-free. The entailment-based evaluation we propose allows the assignment of bonus or partial marks by quantifying the inference gap between answers, enabling a nuanced ranking of answer correctness that has higher AUC than current methods.
