Accurate and Nuanced Open-QA Evaluation Through Textual Entailment

Peiran Yao; Denilson Barbosa

Accurate and Nuanced Open-QA Evaluation Through Textual Entailment

Peiran Yao, Denilson Barbosa

TL;DR

This work tackles the shortcomings of Open-QA evaluation, notably question ambiguity and limited semantic understanding, by proposing a learning-free textual entailment framework for judging Open-QA answers. It formalizes an Answer Hierarchy that classifies system outputs into sets like $A_{sup}$, $A_{inf}$, and their union, enabling partial and bonus marks based on the inference gap between system and gold answers. The approach, validated on Evouna-derived splits of NaturalQuestions ($NQ$) and TriviaQA ($TQ$), demonstrates higher alignment with human judgments (e.g., AUROC/accuracy) than traditional lexical or prompt-based evaluators, and even rivals finetuned evaluators when entailment is used as a feature. Importantly, the method can outperform prompt-engineering baselines in an out-of-the-box setting and offers a structured way to provide nuanced, partial credit for near-correct answers, though its applicability to more complex, multi-passage QA tasks remains an area for future work. Overall, entailment-based evaluation provides a robust, generalizable, and potentially training-signaling approach for Open-QA benchmarking with practical impact for model development and evaluation.

Abstract

Open-domain question answering (Open-QA) is a common task for evaluating large language models (LLMs). However, current Open-QA evaluations are criticized for the ambiguity in questions and the lack of semantic understanding in evaluators. Complex evaluators, powered by foundation models or LLMs and pertaining to semantic equivalence, still deviate from human judgments by a large margin. We propose to study the entailment relations of answers to identify more informative and more general system answers, offering a much closer evaluation to human judgment on both NaturalQuestions and TriviaQA while being learning-free. The entailment-based evaluation we propose allows the assignment of bonus or partial marks by quantifying the inference gap between answers, enabling a nuanced ranking of answer correctness that has higher AUC than current methods.

Accurate and Nuanced Open-QA Evaluation Through Textual Entailment

TL;DR

, and their union, enabling partial and bonus marks based on the inference gap between system and gold answers. The approach, validated on Evouna-derived splits of NaturalQuestions (

) and TriviaQA (

), demonstrates higher alignment with human judgments (e.g., AUROC/accuracy) than traditional lexical or prompt-based evaluators, and even rivals finetuned evaluators when entailment is used as a feature. Importantly, the method can outperform prompt-engineering baselines in an out-of-the-box setting and offers a structured way to provide nuanced, partial credit for near-correct answers, though its applicability to more complex, multi-passage QA tasks remains an area for future work. Overall, entailment-based evaluation provides a robust, generalizable, and potentially training-signaling approach for Open-QA benchmarking with practical impact for model development and evaluation.

Abstract

Paper Structure (23 sections, 1 figure, 17 tables)

This paper contains 23 sections, 1 figure, 17 tables.

Introduction
Related Work
The Answer Hierarchy
The answer hierarchy is a superior automated evaluator.
Although learning-free, entailment is comparable to finetuned evaluators.
Out-of-the-box entailment outperforms prompt engineering.
Towards Partial Marks
Conclusion
Entailment Test Implementation
Detailed Settings
Assessment of Reliability
Reliability of question-answer to statement conversion.
Reliability of textual entailment test.
Reliability of hierarchy construction.
Reliability of QA evaluation.
...and 8 more sections

Figures (1)

Figure 1: QA systems may generate a variety of correct answers that are neither exact matches nor semantic equivalents of the gold answer. Judging by the amount of information relevant to the gold answer that the system answers provide, we obtain a partial order of system answers with respect to the gold answer using textual entailment, and group answers into a hierarchy of subsets.

Accurate and Nuanced Open-QA Evaluation Through Textual Entailment

TL;DR

Abstract

Accurate and Nuanced Open-QA Evaluation Through Textual Entailment

Authors

TL;DR

Abstract

Table of Contents

Figures (1)