Table of Contents
Fetching ...

DeviceBERT: Applied Transfer Learning With Targeted Annotations and Vocabulary Enrichment to Identify Medical Device and Component Terminology in FDA Recall Summaries

Miriam Farrington

TL;DR

DeviceBERT tackles the challenge of identifying medical device terminology in FDA recall summaries by building a vocabulary-enriched extension of BioBERT and coupling it with a targeted annotation pipeline. The approach expands the tokenizer with a large domain-specific lexicon, applies BIO tagging for device terms, and uses regularization and cross-validation to prevent overfitting on limited data. Experimental results on OpenFDA recalls show substantial gains, with the Reg+Vocab configuration reaching an F1 of 83.56 and outperforming BioBERT by a notable margin, demonstrating practical potential for accurate device-entity extraction in recall analysis. The work highlights a scalable path to domain adaptation for sub-domains with scarce labeled data and points toward future extensions into device named entity linking and broader recall analytics.

Abstract

FDA Medical Device recalls are critical and time-sensitive events, requiring swift identification of impacted devices to inform the public of a recall event and ensure patient safety. The OpenFDA device recall dataset contains valuable information about ongoing device recall actions, but manually extracting relevant device information from the recall action summaries is a time-consuming task. Named Entity Recognition (NER) is a task in Natural Language Processing (NLP) that involves identifying and categorizing named entities in unstructured text. Existing NER models, including domain-specific models like BioBERT, struggle to correctly identify medical device trade names, part numbers and component terms within these summaries. To address this, we propose DeviceBERT, a medical device annotation, pre-processing and enrichment pipeline, which builds on BioBERT to identify and label medical device terminology in the device recall summaries with improved accuracy. Furthermore, we demonstrate that our approach can be applied effectively for performing entity recognition tasks where training data is limited or sparse.

DeviceBERT: Applied Transfer Learning With Targeted Annotations and Vocabulary Enrichment to Identify Medical Device and Component Terminology in FDA Recall Summaries

TL;DR

DeviceBERT tackles the challenge of identifying medical device terminology in FDA recall summaries by building a vocabulary-enriched extension of BioBERT and coupling it with a targeted annotation pipeline. The approach expands the tokenizer with a large domain-specific lexicon, applies BIO tagging for device terms, and uses regularization and cross-validation to prevent overfitting on limited data. Experimental results on OpenFDA recalls show substantial gains, with the Reg+Vocab configuration reaching an F1 of 83.56 and outperforming BioBERT by a notable margin, demonstrating practical potential for accurate device-entity extraction in recall analysis. The work highlights a scalable path to domain adaptation for sub-domains with scarce labeled data and points toward future extensions into device named entity linking and broader recall analytics.

Abstract

FDA Medical Device recalls are critical and time-sensitive events, requiring swift identification of impacted devices to inform the public of a recall event and ensure patient safety. The OpenFDA device recall dataset contains valuable information about ongoing device recall actions, but manually extracting relevant device information from the recall action summaries is a time-consuming task. Named Entity Recognition (NER) is a task in Natural Language Processing (NLP) that involves identifying and categorizing named entities in unstructured text. Existing NER models, including domain-specific models like BioBERT, struggle to correctly identify medical device trade names, part numbers and component terms within these summaries. To address this, we propose DeviceBERT, a medical device annotation, pre-processing and enrichment pipeline, which builds on BioBERT to identify and label medical device terminology in the device recall summaries with improved accuracy. Furthermore, we demonstrate that our approach can be applied effectively for performing entity recognition tasks where training data is limited or sparse.
Paper Structure (14 sections, 6 equations, 5 figures, 3 tables)

This paper contains 14 sections, 6 equations, 5 figures, 3 tables.

Figures (5)

  • Figure 1: BioBERT Architecture Lee_2019
  • Figure 2: DeviceBERT Process Overview
  • Figure 3: Partial text example of an annotated recall action with BIO tagging applied to a device trade name
  • Figure 4: Comparative score of all models on device entity recognition task.
  • Figure 5: DeviceBERT Process Flow Diagram