Table of Contents
Fetching ...

NanoNER: Named Entity Recognition for nanobiology using experts' knowledge and distant supervision

Martin Lentschat, Cyril Labbé, Ran Cheng

TL;DR

NanoNER tackles the scarcity of annotated data for nanobiology NER by coupling ontology-driven vocabulary curation with distant supervision to automatically annotate a large corpus (728 full-text articles, >120k entity occurrences). Using five domain labels and 1,438 terms derived from the Nanoparticle Ontology (NPO) and eNanoMapper (ENM), the system achieves high recognition performance on known entities (F1 ≈ 0.98) and reasonable precision (0.77–0.81) on newly discovered entities. Ablation studies reveal a strong dependency on vocabulary coverage, with reorganizing term frequency drastically affecting precision and recall, and demonstrate the model’s capacity to rediscover up to about 30% of ablated terms. The work highlights a scalable, low-manpower path to domain-specific NER that can be generalized to other specialized fields with appropriate ontologies.

Abstract

Here we present the training and evaluation of NanoNER, a Named Entity Recognition (NER) model for Nanobiology. NER consists in the identification of specific entities in spans of unstructured texts and is often a primary task in Natural Language Processing (NLP) and Information Extraction. The aim of our model is to recognise entities previously identified by domain experts as constituting the essential knowledge of the domain. Relying on ontologies, which provide us with a domain vocabulary and taxonomy, we implemented an iterative process enabling experts to determine the entities relevant to the domain at hand. We then delve into the potential of distant supervision learning in NER, supporting how this method can increase the quantity of annotated data with minimal additional manpower. On our full corpus of 728 full-text nanobiology articles, containing more than 120k entity occurrences, NanoNER obtained a F1-score of 0.98 on the recognition of previously known entities. Our model also demonstrated its ability to discover new entities in the text, with precision scores ranging from 0.77 to 0.81. Ablation experiments further confirmed this and allowed us to assess the dependency of our approach on the external resources. It highlighted the dependency of the approach to the resource, while also confirming its ability to rediscover up to 30% of the ablated terms. This paper details the methodology employed, experimental design, and key findings, providing valuable insights and directions for future related researches on NER in specialized domain. Furthermore, since our approach require minimal manpower , we believe that it can be generalized to other specialized fields.

NanoNER: Named Entity Recognition for nanobiology using experts' knowledge and distant supervision

TL;DR

NanoNER tackles the scarcity of annotated data for nanobiology NER by coupling ontology-driven vocabulary curation with distant supervision to automatically annotate a large corpus (728 full-text articles, >120k entity occurrences). Using five domain labels and 1,438 terms derived from the Nanoparticle Ontology (NPO) and eNanoMapper (ENM), the system achieves high recognition performance on known entities (F1 ≈ 0.98) and reasonable precision (0.77–0.81) on newly discovered entities. Ablation studies reveal a strong dependency on vocabulary coverage, with reorganizing term frequency drastically affecting precision and recall, and demonstrate the model’s capacity to rediscover up to about 30% of ablated terms. The work highlights a scalable, low-manpower path to domain-specific NER that can be generalized to other specialized fields with appropriate ontologies.

Abstract

Here we present the training and evaluation of NanoNER, a Named Entity Recognition (NER) model for Nanobiology. NER consists in the identification of specific entities in spans of unstructured texts and is often a primary task in Natural Language Processing (NLP) and Information Extraction. The aim of our model is to recognise entities previously identified by domain experts as constituting the essential knowledge of the domain. Relying on ontologies, which provide us with a domain vocabulary and taxonomy, we implemented an iterative process enabling experts to determine the entities relevant to the domain at hand. We then delve into the potential of distant supervision learning in NER, supporting how this method can increase the quantity of annotated data with minimal additional manpower. On our full corpus of 728 full-text nanobiology articles, containing more than 120k entity occurrences, NanoNER obtained a F1-score of 0.98 on the recognition of previously known entities. Our model also demonstrated its ability to discover new entities in the text, with precision scores ranging from 0.77 to 0.81. Ablation experiments further confirmed this and allowed us to assess the dependency of our approach on the external resources. It highlighted the dependency of the approach to the resource, while also confirming its ability to rediscover up to 30% of the ablated terms. This paper details the methodology employed, experimental design, and key findings, providing valuable insights and directions for future related researches on NER in specialized domain. Furthermore, since our approach require minimal manpower , we believe that it can be generalized to other specialized fields.
Paper Structure (22 sections, 10 tables)