ACE-2005-PT: Corpus for Event Extraction in Portuguese
Luís Filipe Cunha, Purificação Silvano, Ricardo Campos, Alípio Jorge
TL;DR
This work extends the ACE-2005 event extraction corpus to Portuguese by automatically translating the English source into European and Brazilian variants and transferring annotations through a multi-technique alignment pipeline. The pipeline combines lemmatization, multiple translations, a Transformer-based word aligner, and fuzzy matching to recover accurate annotation spans, with a linguist validating a subset for reliability. The resulting ACE-2005-PT corpus, benchmarked against manual alignments, achieves strong relaxed and competitive exact alignment scores and enables Portuguese-focused event extraction research. The approach is designed to be adaptable to other languages and corpora, and the Portuguese corpus has been accepted by the LDC, facilitating broader multilingual EE research and applications.
Abstract
Event extraction is an NLP task that commonly involves identifying the central word (trigger) for an event and its associated arguments in text. ACE-2005 is widely recognised as the standard corpus in this field. While other corpora, like PropBank, primarily focus on annotating predicate-argument structure, ACE-2005 provides comprehensive information about the overall event structure and semantics. However, its limited language coverage restricts its usability. This paper introduces ACE-2005-PT, a corpus created by translating ACE-2005 into Portuguese, with European and Brazilian variants. To speed up the process of obtaining ACE-2005-PT, we rely on automatic translators. This, however, poses some challenges related to automatically identifying the correct alignments between multi-word annotations in the original text and in the corresponding translated sentence. To achieve this, we developed an alignment pipeline that incorporates several alignment techniques: lemmatization, fuzzy matching, synonym matching, multiple translations and a BERT-based word aligner. To measure the alignment effectiveness, a subset of annotations from the ACE-2005-PT corpus was manually aligned by a linguist expert. This subset was then compared against our pipeline results which achieved exact and relaxed match scores of 70.55\% and 87.55\% respectively. As a result, we successfully generated a Portuguese version of the ACE-2005 corpus, which has been accepted for publication by LDC.
