Table of Contents
Fetching ...

Event Extraction for Portuguese: A QA-driven Approach using ACE-2005

Luís Filipe Cunha, Ricardo Campos, Alípio Jorge

TL;DR

The paper addresses the lack of Portuguese resources for event extraction by translating the ACE-2005 corpus into Portuguese and developing a two-stage QA-driven framework. Trigger extraction is modeled as token classification, while argument extraction uses extractive QA with question templates conditioned on the trigger; the Portuguese ACE-2005 translation enables supervised learning for both tasks. The approach, based on fine-tuning BERTimbau on the translated ACE-2005 data and leveraging SQuAD-derived QA data (including a version with impossible answers), achieves a trigger F1 of 64.4 and an argument F1 of 46.7 on the Portuguese test set, establishing a new baseline for Portuguese event extraction. The work also contributes a data-processing pipeline for translating annotated datasets, an open-source event extraction framework on Huggingface, and a discussion of translation/alignment challenges that informs future improvements in cross-lingual event extraction.

Abstract

Event extraction is an Information Retrieval task that commonly consists of identifying the central word for the event (trigger) and the event's arguments. This task has been extensively studied for English but lags behind for Portuguese, partly due to the lack of task-specific annotated corpora. This paper proposes a framework in which two separated BERT-based models were fine-tuned to identify and classify events in Portuguese documents. We decompose this task into two sub-tasks. Firstly, we use a token classification model to detect event triggers. To extract event arguments, we train a Question Answering model that queries the triggers about their corresponding event argument roles. Given the lack of event annotated corpora in Portuguese, we translated the original version of the ACE-2005 dataset (a reference in the field) into Portuguese, producing a new corpus for Portuguese event extraction. To accomplish this, we developed an automatic translation pipeline. Our framework obtains F1 marks of 64.4 for trigger classification and 46.7 for argument classification setting, thus a new state-of-the-art reference for these tasks in Portuguese.

Event Extraction for Portuguese: A QA-driven Approach using ACE-2005

TL;DR

The paper addresses the lack of Portuguese resources for event extraction by translating the ACE-2005 corpus into Portuguese and developing a two-stage QA-driven framework. Trigger extraction is modeled as token classification, while argument extraction uses extractive QA with question templates conditioned on the trigger; the Portuguese ACE-2005 translation enables supervised learning for both tasks. The approach, based on fine-tuning BERTimbau on the translated ACE-2005 data and leveraging SQuAD-derived QA data (including a version with impossible answers), achieves a trigger F1 of 64.4 and an argument F1 of 46.7 on the Portuguese test set, establishing a new baseline for Portuguese event extraction. The work also contributes a data-processing pipeline for translating annotated datasets, an open-source event extraction framework on Huggingface, and a discussion of translation/alignment challenges that informs future improvements in cross-lingual event extraction.

Abstract

Event extraction is an Information Retrieval task that commonly consists of identifying the central word for the event (trigger) and the event's arguments. This task has been extensively studied for English but lags behind for Portuguese, partly due to the lack of task-specific annotated corpora. This paper proposes a framework in which two separated BERT-based models were fine-tuned to identify and classify events in Portuguese documents. We decompose this task into two sub-tasks. Firstly, we use a token classification model to detect event triggers. To extract event arguments, we train a Question Answering model that queries the triggers about their corresponding event argument roles. Given the lack of event annotated corpora in Portuguese, we translated the original version of the ACE-2005 dataset (a reference in the field) into Portuguese, producing a new corpus for Portuguese event extraction. To accomplish this, we developed an automatic translation pipeline. Our framework obtains F1 marks of 64.4 for trigger classification and 46.7 for argument classification setting, thus a new state-of-the-art reference for these tasks in Portuguese.
Paper Structure (15 sections, 1 equation, 2 tables)