Table of Contents
Fetching ...

Software Mention Recognition with a Three-Stage Framework Based on BERTology Models at SOMD 2024

Thuy Nguyen Thi, Anh Nguyen Viet, Thin Dang Van, Ngan Nguyen Luu Thuy

TL;DR

This paper tackles Software Mention Recognition in scholarly publications (SOMD 2024 Sub-task I) by comparing three transformer-based strategies using BERT, SciBERT, and XLM-R. It introduces a three-stage framework (Approach 3) that first detects if a sentence contains any software mention, then extracts the mentions, and finally classifies them into 13 types, achieving the best results among the approaches. The XLM-R backbone with the three-stage pipeline achieves the highest F1 score (about 0.678) and secures 3rd place in the private test, highlighting the value of sentence-level gating and multi-stage refinement for domain-specific NER. The work demonstrates how staged architectures can mitigate data sparsity and improve robustness in specialized entity recognition tasks on the SoMeSci dataset, with implications for automated information extraction from scientific literature.

Abstract

This paper describes our systems for the sub-task I in the Software Mention Detection in Scholarly Publications shared-task. We propose three approaches leveraging different pre-trained language models (BERT, SciBERT, and XLM-R) to tackle this challenge. Our bestperforming system addresses the named entity recognition (NER) problem through a three-stage framework. (1) Entity Sentence Classification - classifies sentences containing potential software mentions; (2) Entity Extraction - detects mentions within classified sentences; (3) Entity Type Classification - categorizes detected mentions into specific software types. Experiments on the official dataset demonstrate that our three-stage framework achieves competitive performance, surpassing both other participating teams and our alternative approaches. As a result, our framework based on the XLM-R-based model achieves a weighted F1-score of 67.80%, delivering our team the 3rd rank in Sub-task I for the Software Mention Recognition task.

Software Mention Recognition with a Three-Stage Framework Based on BERTology Models at SOMD 2024

TL;DR

This paper tackles Software Mention Recognition in scholarly publications (SOMD 2024 Sub-task I) by comparing three transformer-based strategies using BERT, SciBERT, and XLM-R. It introduces a three-stage framework (Approach 3) that first detects if a sentence contains any software mention, then extracts the mentions, and finally classifies them into 13 types, achieving the best results among the approaches. The XLM-R backbone with the three-stage pipeline achieves the highest F1 score (about 0.678) and secures 3rd place in the private test, highlighting the value of sentence-level gating and multi-stage refinement for domain-specific NER. The work demonstrates how staged architectures can mitigate data sparsity and improve robustness in specialized entity recognition tasks on the SoMeSci dataset, with implications for automated information extraction from scientific literature.

Abstract

This paper describes our systems for the sub-task I in the Software Mention Detection in Scholarly Publications shared-task. We propose three approaches leveraging different pre-trained language models (BERT, SciBERT, and XLM-R) to tackle this challenge. Our bestperforming system addresses the named entity recognition (NER) problem through a three-stage framework. (1) Entity Sentence Classification - classifies sentences containing potential software mentions; (2) Entity Extraction - detects mentions within classified sentences; (3) Entity Type Classification - categorizes detected mentions into specific software types. Experiments on the official dataset demonstrate that our three-stage framework achieves competitive performance, surpassing both other participating teams and our alternative approaches. As a result, our framework based on the XLM-R-based model achieves a weighted F1-score of 67.80%, delivering our team the 3rd rank in Sub-task I for the Software Mention Recognition task.
Paper Structure (11 sections, 1 figure, 7 tables)

This paper contains 11 sections, 1 figure, 7 tables.

Figures (1)

  • Figure 1: Overview system of three approaches: Sample input is "Celeste was written in C #" with two entities are E_1 and E_2. E_1 and E_2 play the role of two entity types in this example