Table of Contents
Fetching ...

ScispaCy: Fast and Robust Models for Biomedical Natural Language Processing

Mark Neumann, Daniel King, Iz Beltagy, Waleed Ammar

TL;DR

The paper addresses the challenge of robust biomedical NLP under domain shift by introducing scispaCy, a spaCy-based toolkit that retrains POS tagging, dependency parsing, and NER on biomedical data and enhances tokenization. It presents two end-to-end pipelines (en_core_sci_sm and en_core_sci_md) that balance vocabulary, vectors, and speed, and demonstrates competitive performance across multiple datasets along with substantial speed advantages. A comprehensive evaluation covers POS tagging, parsing, NER across diverse corpora, and includes a candidate generation component for biomedical entity linking using UMLS with aliases and abbreviation handling. The work also reformats GENIA data into Universal Dependencies and releases access to this resource, underscoring practical impact for integration in Python-based biomedical information extraction workflows. Overall, scispaCy provides fast, robust, and adaptable tools for core NLP tasks in biomedicine, enabling scalable downstream applications with practical deployment benefits.

Abstract

Despite recent advances in natural language processing, many statistical models for processing text perform extremely poorly under domain shift. Processing biomedical and clinical text is a critically important application area of natural language processing, for which there are few robust, practical, publicly available models. This paper describes scispaCy, a new tool for practical biomedical/scientific text processing, which heavily leverages the spaCy library. We detail the performance of two packages of models released in scispaCy and demonstrate their robustness on several tasks and datasets. Models and code are available at https://allenai.github.io/scispacy/

ScispaCy: Fast and Robust Models for Biomedical Natural Language Processing

TL;DR

The paper addresses the challenge of robust biomedical NLP under domain shift by introducing scispaCy, a spaCy-based toolkit that retrains POS tagging, dependency parsing, and NER on biomedical data and enhances tokenization. It presents two end-to-end pipelines (en_core_sci_sm and en_core_sci_md) that balance vocabulary, vectors, and speed, and demonstrates competitive performance across multiple datasets along with substantial speed advantages. A comprehensive evaluation covers POS tagging, parsing, NER across diverse corpora, and includes a candidate generation component for biomedical entity linking using UMLS with aliases and abbreviation handling. The work also reformats GENIA data into Universal Dependencies and releases access to this resource, underscoring practical impact for integration in Python-based biomedical information extraction workflows. Overall, scispaCy provides fast, robust, and adaptable tools for core NLP tasks in biomedicine, enabling scalable downstream applications with practical deployment benefits.

Abstract

Despite recent advances in natural language processing, many statistical models for processing text perform extremely poorly under domain shift. Processing biomedical and clinical text is a critically important application area of natural language processing, for which there are few robust, practical, publicly available models. This paper describes scispaCy, a new tool for practical biomedical/scientific text processing, which heavily leverages the spaCy library. We detail the performance of two packages of models released in scispaCy and demonstrate their robustness on several tasks and datasets. Models and code are available at https://allenai.github.io/scispacy/

Paper Structure

This paper contains 21 sections, 3 figures, 8 tables.

Figures (3)

  • Figure 1: Growth of the annual number of cited references from 1650 to 2012 in the medical and health sciences (citing publications from 1980 to 2012). Figure from DBLP:journals/corr/BornmannM14.
  • Figure 2: Unlabeled attachment score (UAS) performance for an model trained with increasing amounts of web data incorporated. Table shows mean of 3 random seeds.
  • Figure 3: Gold Candidate Generation Recall for different values of K. Note that K refers to the number of nearest neighbour queries, and not the number of considered candidates. Murty2018HierarchicalLA do not report this distinction, but for a given K the same amount of work is done (retrieving K neighbours from the index), so results are comparable. For all K, the actual number of candidates is considerably lower on average.