Nested Named Entity Recognition in Plasma Physics Research Articles
Muhammad Haris, Hans Höft, Markus M. Becker, Markus Stocker
TL;DR
This work tackles nested NER in plasma physics articles, where domain-specific terminology and overlapping entity spans hinder standard approaches. It proposes a lightweight BERT–CRF model with per-entity-type specialization, trained on a newly annotated 16-class plasma physics dataset, and further enhanced by Bayesian Optimization to tune hyperparameters (e.g., $learning extunderscore rate$, $batch extunderscore size$, $weight extunderscore decay$) for improved $F_1$. The authors evaluate on the plasma dataset and cross-dataset benchmarks (GENIA and the Chilean Waiting List), showing competitive $F_1$ scores and notably high recall, while maintaining lower architectural complexity than many state-of-the-art nested NER models. The dataset and code are released to enable reproducible, domain-specific information extraction that supports improved literature search and knowledge discovery in plasma physics.
Abstract
Named Entity Recognition (NER) is an important task in natural language processing that aims to identify and extract key entities from unstructured text. We present a novel application of NER in plasma physics research articles and address the challenges of extracting specialized entities from scientific text in this domain. Research articles in plasma physics often contain highly complex and context-rich content that must be extracted to enable, e.g., advanced search. We propose a lightweight approach based on encoder-transformers and conditional random fields to extract (nested) named entities from plasma physics research articles. First, we annotate a plasma physics corpus with 16 classes specifically designed for the nested NER task. Second, we evaluate an entity-specific model specialization approach, where independent BERT-CRF models are trained to recognize individual entity types in plasma physics text. Third, we integrate an optimization process to systematically fine-tune hyperparameters and enhance model performance. Our work contributes to the advancement of entity recognition in plasma physics and also provides a foundation to support researchers in navigating and analyzing scientific literature.
