Table of Contents
Fetching ...

Return of EM: Entity-driven Answer Set Expansion for QA Evaluation

Dongryeol Lee, Minwoo Lee, Kyungmin Min, Joonsuk Park, Kyomin Jung

TL;DR

This paper tackles the challenge of evaluating QA models with methods that are reliable like LLM-based evaluators but more interpretable and cost-effective. It introduces Soft EM with entity-driven answer set expansion, which expands gold answers by leveraging entity-type surface-form patterns via in-context learning and few-shot prompts. The approach improves reliability over lexical metrics and is competitive with model-based evaluators, while reducing inference cost and environmental impact; it also enhances interpretability by grounding judgments in explicit gold-form coverage. The work suggests a scalable, transparent QA evaluation paradigm suitable for broad adoption, especially where resources or environmental considerations are constraints.

Abstract

Recently, directly using large language models (LLMs) has been shown to be the most reliable method to evaluate QA models. However, it suffers from limited interpretability, high cost, and environmental harm. To address these, we propose to use soft EM with entity-driven answer set expansion. Our approach expands the gold answer set to include diverse surface forms, based on the observation that the surface forms often follow particular patterns depending on the entity type. The experimental results show that our method outperforms traditional evaluation methods by a large margin. Moreover, the reliability of our evaluation method is comparable to that of LLM-based ones, while offering the benefits of high interpretability and reduced environmental harm.

Return of EM: Entity-driven Answer Set Expansion for QA Evaluation

TL;DR

This paper tackles the challenge of evaluating QA models with methods that are reliable like LLM-based evaluators but more interpretable and cost-effective. It introduces Soft EM with entity-driven answer set expansion, which expands gold answers by leveraging entity-type surface-form patterns via in-context learning and few-shot prompts. The approach improves reliability over lexical metrics and is competitive with model-based evaluators, while reducing inference cost and environmental impact; it also enhances interpretability by grounding judgments in explicit gold-form coverage. The work suggests a scalable, transparent QA evaluation paradigm suitable for broad adoption, especially where resources or environmental considerations are constraints.

Abstract

Recently, directly using large language models (LLMs) has been shown to be the most reliable method to evaluate QA models. However, it suffers from limited interpretability, high cost, and environmental harm. To address these, we propose to use soft EM with entity-driven answer set expansion. Our approach expands the gold answer set to include diverse surface forms, based on the observation that the surface forms often follow particular patterns depending on the entity type. The experimental results show that our method outperforms traditional evaluation methods by a large margin. Moreover, the reliability of our evaluation method is comparable to that of LLM-based ones, while offering the benefits of high interpretability and reduced environmental harm.
Paper Structure (23 sections, 4 figures, 11 tables)

This paper contains 23 sections, 4 figures, 11 tables.

Figures (4)

  • Figure 1: Illustration of Our Method: We expand the original answer set based on the entity type, to include plausible surface forms for each entity type. By incorporating the Soft EM with expanded gold answer, the QA model's prediction is correctly evaluated as right.
  • Figure 2: Comparison of model-based evaluation and ours in terms of the inference calls. As the number of experiments increases, the inference calls for Insteval grow linearly, whereas our method maintains a constant number of interference calls.
  • Figure 3: Average accuracy against human labels across five QA models, using different answer set expansion methods. We separately report the accuracy based on entity types: Numeric, Non-numeric, and N/A.
  • Figure 4: Average accuracy against human labels across five QA models, using original answer set and our methods. We separately report the accuracy based on the rarity of the entity, which is measured by the number of its relevant docs in DBpedia kandpal2023large.