BioMNER: A Dataset for Biomedical Method Entity Recognition
Chen Tang, Bohao Yang, Kun Zhao, Bo Lv, Chenghao Xiao, Frank Guerin, Chenghua Lin
TL;DR
This work tackles Biomedical Method NER (BioMethod NER) amid rapidly expanding domain terminology and limited resources by introducing a high-quality annotated dataset built with an auxiliary annotation pipeline that leverages rule-based cues, ChatGPT, and information retrieval for candidate nomination and validation. It comprehensively benchmarks conventional sequence-labeling models and large-language models, finding that very large LLMs struggle to learn domain-specific extraction patterns, while a compact ALBERT (11MB) paired with CRF achieves state-of-the-art performance. The results highlight the value of explicit sequence-pattern modeling (CRF) within domain-specific NER and suggest that resource-efficient models can outperform bulky LLMs in specialized biomedical tasks. The dataset and annotation methodology offer practical guidance for reducing annotation effort and improving inter-annotator agreement, with broad implications for advancing BioMethod NER research and deployment.
Abstract
Named entity recognition (NER) stands as a fundamental and pivotal task within the realm of Natural Language Processing. Particularly within the domain of Biomedical Method NER, this task presents notable challenges, stemming from the continual influx of domain-specific terminologies in scholarly literature. Current research in Biomedical Method (BioMethod) NER suffers from a scarcity of resources, primarily attributed to the intricate nature of methodological concepts, which necessitate a profound understanding for precise delineation. In this study, we propose a novel dataset for biomedical method entity recognition, employing an automated BioMethod entity recognition and information retrieval system to assist human annotation. Furthermore, we comprehensively explore a range of conventional and contemporary open-domain NER methodologies, including the utilization of cutting-edge large-scale language models (LLMs) customised to our dataset. Our empirical findings reveal that the large parameter counts of language models surprisingly inhibit the effective assimilation of entity extraction patterns pertaining to biomedical methods. Remarkably, the approach, leveraging the modestly sized ALBERT model (only 11MB), in conjunction with conditional random fields (CRF), achieves state-of-the-art (SOTA) performance.
