Table of Contents
Fetching ...

PEDANTS: Cheap but Effective and Interpretable Answer Equivalence

Zongxia Li, Ishani Mondal, Yijun Liang, Huy Nghiem, Jordan Lee Boyd-Graber

TL;DR

This work provides rubrics and datasets for evaluating machine QA adopted from the Trivia community and proposes an efficient, and interpretable QA evaluation that is more stable than an exact match and neural methods.

Abstract

Question answering (QA) can only make progress if we know if an answer is correct, but current answer correctness (AC) metrics struggle with verbose, free-form answers from large language models (LLMs). There are two challenges with current short-form QA evaluations: a lack of diverse styles of evaluation data and an over-reliance on expensive and slow LLMs. LLM-based scorers correlate better with humans, but this expensive task has only been tested on limited QA datasets. We rectify these issues by providing rubrics and datasets for evaluating machine QA adopted from the Trivia community. We also propose an efficient, and interpretable QA evaluation that is more stable than an exact match and neural methods(BERTScore).

PEDANTS: Cheap but Effective and Interpretable Answer Equivalence

TL;DR

This work provides rubrics and datasets for evaluating machine QA adopted from the Trivia community and proposes an efficient, and interpretable QA evaluation that is more stable than an exact match and neural methods.

Abstract

Question answering (QA) can only make progress if we know if an answer is correct, but current answer correctness (AC) metrics struggle with verbose, free-form answers from large language models (LLMs). There are two challenges with current short-form QA evaluations: a lack of diverse styles of evaluation data and an over-reliance on expensive and slow LLMs. LLM-based scorers correlate better with humans, but this expensive task has only been tested on limited QA datasets. We rectify these issues by providing rubrics and datasets for evaluating machine QA adopted from the Trivia community. We also propose an efficient, and interpretable QA evaluation that is more stable than an exact match and neural methods(BERTScore).
Paper Structure (61 sections, 4 equations, 12 figures, 8 tables, 1 algorithm)

This paper contains 61 sections, 4 equations, 12 figures, 8 tables, 1 algorithm.

Figures (12)

  • Figure 1: Different evaluation methods have different requirements of computation resources on short-form and factoid qa datasets. Their pairwise ranking accuracies are based on our annotated data in Section \ref{['subsec:pairwise ranking']}.
  • Figure 2: The size of the circles shows each metrics' human agreement accuracy and the color shows the Macro $F_1$ score . We put pedants first for ease of visualization. em, token $F_1$, and bertScore have unstable human agreement on different qa datasets by looking at horizontal circle size variations-- ranging from $25\%$ to $90\%$. roberta, pedants, and gpt4-Eval are more robust and stable with varying qa datasets. Although Prometheus 2 is fine-tuned for evaluation purposes, it fails on short-form qa. pedants is less costly than gpt-4, Prometheus 2, roberta, and bertScore and has more stable human agreements across seven evaluation datasets than em. Prometheus assigns scores from 1 to 5; we use 4 or higher as indicative of correctness.
  • Figure 3: The pairwise ranking accuracy for dataset that have multiple model responses. The TOTAL is the pairwise ranking accuracy across all six datasets. pedants and GPT-Eval rank more models correctly than other methods on most datasets.
  • Figure 4: We list out more relevant ac pairs with judgments under our revised rules. Some of the candidate responses are manually written, some of them are from Jeopardy! , and some of them are generated by various qa models such as Flan-t5.
  • Figure 5: Rule Distribution on all annotated examples. Rule 3 (less details provided) and Rule 4 (more details provided), and Rule 6 (irrelevant information) are the most common rules among our test datasets. There are only under 10 examples for rule 5. Thus, we use 0.01 to signify that there are still some examples for Rule 5.
  • ...and 7 more figures