LINKAGE: Listwise Ranking among Varied-Quality References for Non-Factoid QA Evaluation via LLMs

Sihui Yang; Keping Bi; Wanqing Cui; Jiafeng Guo; Xueqi Cheng

LINKAGE: Listwise Ranking among Varied-Quality References for Non-Factoid QA Evaluation via LLMs

Sihui Yang, Keping Bi, Wanqing Cui, Jiafeng Guo, Xueqi Cheng

TL;DR

This work proposes a novel listwise NFQA evaluation approach, that utilizes LLMs to rank candidate answers in a list of reference answers sorted by descending quality, and has significantly higher correlations with human annotations compared to automatic scores and common pointwise and pairwise approaches.

Abstract

Non-Factoid (NF) Question Answering (QA) is challenging to evaluate due to diverse potential answers and no objective criterion. The commonly used automatic evaluation metrics like ROUGE or BERTScore cannot accurately measure semantic similarities or answers from different perspectives. Recently, Large Language Models (LLMs) have been resorted to for NFQA evaluation due to their compelling performance on various NLP tasks. Common approaches include pointwise scoring of each candidate answer and pairwise comparisons between answers. Inspired by the evolution from pointwise to pairwise to listwise in learning-to-rank methods, we propose a novel listwise NFQA evaluation approach, that utilizes LLMs to rank candidate answers in a list of reference answers sorted by descending quality. Moreover, for NF questions that do not have multi-grade or any golden answers, we leverage LLMs to generate the reference answer list of various quality to facilitate the listwise evaluation. Extensive experimental results on three NFQA datasets, i.e., ANTIQUE, the TREC-DL-NF, and WebGLM show that our method has significantly higher correlations with human annotations compared to automatic scores and common pointwise and pairwise approaches.

LINKAGE: Listwise Ranking among Varied-Quality References for Non-Factoid QA Evaluation via LLMs

TL;DR

Abstract

Paper Structure (30 sections, 5 equations, 11 figures, 10 tables)

This paper contains 30 sections, 5 equations, 11 figures, 10 tables.

Introduction
Related Work
Non-factoid Question Answering(QA)
Non-factoid QA Evaluation
Method
Preliminary
Listwise Ranking Evaluation (LINKAGE)
Reference List Construction
Multi-grade Ground Truth
Single-grade Ground Truth
Absence of Ground Truth
Experimental Settings
Datasets
Methods for Comparison
Baselines
...and 15 more sections

Figures (11)

Figure 1: Pointwise scoring evaluation, pairwise comparison evaluation and our LINKAGE evaluation approaches.
Figure 2: Comparison of Spearman Correlation for Mistral and ChatGPT on ANTIQUE and TREC-DL-NF. The error bars denote the standard deviation, illustrating the variability in the results.
Figure 3: An example of our LINKAGE compared with pointwise and pairwise approaches. We standardized the score range of all methods to $[0, 10]$ for easy comparison and understanding.
Figure 4: Instruction for pointwise scoring without references.
Figure 5: Instruction for pointwise scoring with references.
...and 6 more figures

LINKAGE: Listwise Ranking among Varied-Quality References for Non-Factoid QA Evaluation via LLMs

TL;DR

Abstract

LINKAGE: Listwise Ranking among Varied-Quality References for Non-Factoid QA Evaluation via LLMs

Authors

TL;DR

Abstract

Table of Contents

Figures (11)