Table of Contents
Fetching ...

TextEE: Benchmark, Reevaluation, Reflections, and Future Challenges in Event Extraction

Kuan-Hao Huang, I-Hung Hsu, Tanmay Parekh, Zhiyu Xie, Zixuan Zhang, Premkumar Natarajan, Kai-Wei Chang, Nanyun Peng, Heng Ji

TL;DR

TextEE addresses pervasive evaluation shortcomings in event extraction by delivering the first standardized, fair, and reproducible benchmark. It consolidates 16 diverse datasets with five standardized splits, re-implements 14 recent EE methods, and benchmarks multiple LLMs to expose gaps between reported and real-world performance. The work introduces AI+ and AC+ metrics to better capture argument attachment quality and provides a comprehensive reevaluation across end-to-end EE, event detection, and event argument extraction. It also discusses the evolving role of event extraction in the NLP era, outlining challenges in generalization, domain expansion, and efficiency for future research.

Abstract

Event extraction has gained considerable interest due to its wide-ranging applications. However, recent studies draw attention to evaluation issues, suggesting that reported scores may not accurately reflect the true performance. In this work, we identify and address evaluation challenges, including inconsistency due to varying data assumptions or preprocessing steps, the insufficiency of current evaluation frameworks that may introduce dataset or data split bias, and the low reproducibility of some previous approaches. To address these challenges, we present TextEE, a standardized, fair, and reproducible benchmark for event extraction. TextEE comprises standardized data preprocessing scripts and splits for 16 datasets spanning eight diverse domains and includes 14 recent methodologies, conducting a comprehensive benchmark reevaluation. We also evaluate five varied large language models on our TextEE benchmark and demonstrate how they struggle to achieve satisfactory performance. Inspired by our reevaluation results and findings, we discuss the role of event extraction in the current NLP era, as well as future challenges and insights derived from TextEE. We believe TextEE, the first standardized comprehensive benchmarking tool, will significantly facilitate future event extraction research.

TextEE: Benchmark, Reevaluation, Reflections, and Future Challenges in Event Extraction

TL;DR

TextEE addresses pervasive evaluation shortcomings in event extraction by delivering the first standardized, fair, and reproducible benchmark. It consolidates 16 diverse datasets with five standardized splits, re-implements 14 recent EE methods, and benchmarks multiple LLMs to expose gaps between reported and real-world performance. The work introduces AI+ and AC+ metrics to better capture argument attachment quality and provides a comprehensive reevaluation across end-to-end EE, event detection, and event argument extraction. It also discusses the evolving role of event extraction in the NLP era, outlining challenges in generalization, domain expansion, and efficiency for future research.

Abstract

Event extraction has gained considerable interest due to its wide-ranging applications. However, recent studies draw attention to evaluation issues, suggesting that reported scores may not accurately reflect the true performance. In this work, we identify and address evaluation challenges, including inconsistency due to varying data assumptions or preprocessing steps, the insufficiency of current evaluation frameworks that may introduce dataset or data split bias, and the low reproducibility of some previous approaches. To address these challenges, we present TextEE, a standardized, fair, and reproducible benchmark for event extraction. TextEE comprises standardized data preprocessing scripts and splits for 16 datasets spanning eight diverse domains and includes 14 recent methodologies, conducting a comprehensive benchmark reevaluation. We also evaluate five varied large language models on our TextEE benchmark and demonstrate how they struggle to achieve satisfactory performance. Inspired by our reevaluation results and findings, we discuss the role of event extraction in the current NLP era, as well as future challenges and insights derived from TextEE. We believe TextEE, the first standardized comprehensive benchmarking tool, will significantly facilitate future event extraction research.
Paper Structure (45 sections, 1 figure, 13 tables)

This paper contains 45 sections, 1 figure, 13 tables.

Figures (1)

  • Figure 1: An example of a Justice-Execution event. One trigger span (execution) and two argument roles, Indonesia (Agent) and convicts (Person), are identified.