Beyond Exact Match: Semantically Reassessing Event Extraction by Large Language Models
Yi-Fan Lu, Xian-Ling Mao, Tian Lan, Heyan Huang, Chen Xu, Xiaoyan Gao
TL;DR
The paper tackles the bias of token-level exact-match evaluation in event extraction by introducing RAEE, a semantic-level evaluation framework that uses LLMs as evaluation agents with an adaptive prompting mechanism. RAEE computes semantic precision and recall in context (P_c and G_r) to yield a semantic F1 score, aligning assessments with human judgments. Across 14 models and 10 datasets, RAEE reveals that EM substantially underestimates performance, especially for generative models and LLMs, and it often changes model rankings. The work also analyzes misjudgment and failure modes, assesses LLM judgments versus human judgments, and provides an evaluation toolkit, while outlining future work toward open-domain evaluation and prompt sensitivity considerations.
Abstract
Event extraction has gained extensive research attention due to its broad range of applications. However, the current mainstream evaluation method for event extraction relies on token-level exact match, which misjudges numerous semantic-level correct cases. This reliance leads to a significant discrepancy between the evaluated performance of models under exact match criteria and their real performance. To address this problem, we propose a reliable and semantic evaluation framework for event extraction, named RAEE, which accurately assesses extraction results at semantic-level instead of token-level. Specifically, RAEE leverages large language models (LLMs) as evaluation agents, incorporating an adaptive mechanism to achieve adaptive evaluations for precision and recall of triggers and arguments. Extensive experiments demonstrate that: (1) RAEE achieves a very strong correlation with human judgments; (2) after reassessing 14 models, including advanced LLMs, on 10 datasets, there is a significant performance gap between exact match and RAEE. The exact match evaluation significantly underestimates the performance of existing event extraction models, and in particular underestimates the capabilities of LLMs; (3) fine-grained analysis under RAEE evaluation reveals insightful phenomena worth further exploration. The evaluation toolkit of our proposed RAEE is publicly released.
