Table of Contents
Fetching ...

Beyond Exact Match: Semantically Reassessing Event Extraction by Large Language Models

Yi-Fan Lu, Xian-Ling Mao, Tian Lan, Heyan Huang, Chen Xu, Xiaoyan Gao

TL;DR

The paper tackles the bias of token-level exact-match evaluation in event extraction by introducing RAEE, a semantic-level evaluation framework that uses LLMs as evaluation agents with an adaptive prompting mechanism. RAEE computes semantic precision and recall in context (P_c and G_r) to yield a semantic F1 score, aligning assessments with human judgments. Across 14 models and 10 datasets, RAEE reveals that EM substantially underestimates performance, especially for generative models and LLMs, and it often changes model rankings. The work also analyzes misjudgment and failure modes, assesses LLM judgments versus human judgments, and provides an evaluation toolkit, while outlining future work toward open-domain evaluation and prompt sensitivity considerations.

Abstract

Event extraction has gained extensive research attention due to its broad range of applications. However, the current mainstream evaluation method for event extraction relies on token-level exact match, which misjudges numerous semantic-level correct cases. This reliance leads to a significant discrepancy between the evaluated performance of models under exact match criteria and their real performance. To address this problem, we propose a reliable and semantic evaluation framework for event extraction, named RAEE, which accurately assesses extraction results at semantic-level instead of token-level. Specifically, RAEE leverages large language models (LLMs) as evaluation agents, incorporating an adaptive mechanism to achieve adaptive evaluations for precision and recall of triggers and arguments. Extensive experiments demonstrate that: (1) RAEE achieves a very strong correlation with human judgments; (2) after reassessing 14 models, including advanced LLMs, on 10 datasets, there is a significant performance gap between exact match and RAEE. The exact match evaluation significantly underestimates the performance of existing event extraction models, and in particular underestimates the capabilities of LLMs; (3) fine-grained analysis under RAEE evaluation reveals insightful phenomena worth further exploration. The evaluation toolkit of our proposed RAEE is publicly released.

Beyond Exact Match: Semantically Reassessing Event Extraction by Large Language Models

TL;DR

The paper tackles the bias of token-level exact-match evaluation in event extraction by introducing RAEE, a semantic-level evaluation framework that uses LLMs as evaluation agents with an adaptive prompting mechanism. RAEE computes semantic precision and recall in context (P_c and G_r) to yield a semantic F1 score, aligning assessments with human judgments. Across 14 models and 10 datasets, RAEE reveals that EM substantially underestimates performance, especially for generative models and LLMs, and it often changes model rankings. The work also analyzes misjudgment and failure modes, assesses LLM judgments versus human judgments, and provides an evaluation toolkit, while outlining future work toward open-domain evaluation and prompt sensitivity considerations.

Abstract

Event extraction has gained extensive research attention due to its broad range of applications. However, the current mainstream evaluation method for event extraction relies on token-level exact match, which misjudges numerous semantic-level correct cases. This reliance leads to a significant discrepancy between the evaluated performance of models under exact match criteria and their real performance. To address this problem, we propose a reliable and semantic evaluation framework for event extraction, named RAEE, which accurately assesses extraction results at semantic-level instead of token-level. Specifically, RAEE leverages large language models (LLMs) as evaluation agents, incorporating an adaptive mechanism to achieve adaptive evaluations for precision and recall of triggers and arguments. Extensive experiments demonstrate that: (1) RAEE achieves a very strong correlation with human judgments; (2) after reassessing 14 models, including advanced LLMs, on 10 datasets, there is a significant performance gap between exact match and RAEE. The exact match evaluation significantly underestimates the performance of existing event extraction models, and in particular underestimates the capabilities of LLMs; (3) fine-grained analysis under RAEE evaluation reveals insightful phenomena worth further exploration. The evaluation toolkit of our proposed RAEE is publicly released.

Paper Structure

This paper contains 32 sections, 2 equations, 5 figures, 15 tables.

Figures (5)

  • Figure 1: One case of EAE is misjudged by exact match (EM) evaluation but reassessed as correct by our proposed RAEE evaluation. With a given trigger (walk), EAE aims to extract its arguments with semantic roles.
  • Figure 2: Evaluation process of RAEE, using the precision of EAE as an example (consistent with ED).
  • Figure 3: Distribution of the reasons that misjudges by EM evaluation method on ED and EAE tasks.
  • Figure 4: Two examples of Unannotated Correct cases.
  • Figure 5: Distribution of failure modes under our proposed RAEE evaluation framework on ED and EAE.