CoastTerm: a Corpus for Multidisciplinary Term Extraction in Coastal Scientific Literature
Julien Delaunay, Hanh Thi Hong Tran, Carlos-Emiliano González-Gallardo, Georgeta Bordea, Mathilde Ducos, Nicolas Sidere, Antoine Doucet, Senja Pollak, Olivier De Viron
TL;DR
CoastTerm addresses the need for a multidisciplinary corpus to extract and classify coastal-domain terms across domains. The authors adapt the ARDI framework to annotate Actors, Resources, Processes, Quality, and Location, and create two gold-annotated datasets (KB-recommended and human-recommended) for ATE and ATC, evaluated with mono- and multilingual transformers (RoBERTa, XLM-R). The results show ATE F1 around 80% and ATC F1 around 70%, indicating strong promise for automatic term extraction and labeling in coastal science and paving the way for a coastal knowledge base. The work enables cross-domain term understanding and supports development of knowledge graphs to inform coast-related policy and research.
Abstract
The growing impact of climate change on coastal areas, particularly active but fragile regions, necessitates collaboration among diverse stakeholders and disciplines to formulate effective environmental protection policies. We introduce a novel specialized corpus comprising 2,491 sentences from 410 scientific abstracts concerning coastal areas, for the Automatic Term Extraction (ATE) and Classification (ATC) tasks. Inspired by the ARDI framework, focused on the identification of Actors, Resources, Dynamics and Interactions, we automatically extract domain terms and their distinct roles in the functioning of coastal systems by leveraging monolingual and multilingual transformer models. The evaluation demonstrates consistent results, achieving an F1 score of approximately 80\% for automated term extraction and F1 of 70\% for extracting terms and their labels. These findings are promising and signify an initial step towards the development of a specialized Knowledge Base dedicated to coastal areas.
