ULTRA: Unleash LLMs' Potential for Event Argument Extraction through Hierarchical Modeling and Pair-wise Self-Refinement
Xinliang Frederick Zhang, Carter Blum, Temma Choji, Shalin Shah, Alakananda Vempala
TL;DR
This work tackles document-level event argument extraction (DocEAE) by introducing ULTRA, a hierarchical framework that first derives candidate arguments from chunk-based local processing and then refines them via self-refinement and boundary correction. LEAFER addresses argument boundary identification, while an optional ULTRA+ ensemble integrates a document-level extractor to capture full-article reasoning. ULTRA achieves state-of-the-art EM and HM on DocEE with lower monetary cost compared to strong baselines, aided by calibrated pairwise ranking and the inverted-pyramid pruning strategy. The approach demonstrates strong generalizability with limited annotations and offers tunable window sizes to balance recall and precision, making it practical for real-world, cost-constrained deployments. The work advances open-source LLM utilization for DocEAE and provides a blueprint for scalable, boundary-aware, document-wide information extraction.
Abstract
Structural extraction of events within discourse is critical since it avails a deeper understanding of communication patterns and behavior trends. Event argument extraction (EAE), at the core of event-centric understanding, is the task of identifying role-specific text spans (i.e., arguments) for a given event. Document-level EAE (DocEAE) focuses on arguments that are scattered across an entire document. In this work, we explore open-source Large Language Models (LLMs) for DocEAE, and propose ULTRA, a hierarchical framework that extracts event arguments more cost-effectively. Further, it alleviates the positional bias issue intrinsic to LLMs. ULTRA sequentially reads text chunks of a document to generate a candidate argument set, upon which non-pertinent candidates are dropped through self-refinement. We introduce LEAFER to address the challenge LLMs face in locating the exact boundary of an argument. ULTRA outperforms strong baselines, including strong supervised models and ChatGPT, by 9.8% when evaluated by Exact Match (EM).
