Table of Contents
Fetching ...

Vocabulary Transfer for Biomedical Texts: Add Tokens if You Can Not Add Data

Priyanka Singh, Vladislav D. Mosin, Ivan P. Yamshchikov

TL;DR

This study investigates the potential of vocabulary transfer to enhance model performance in biomedical NLP tasks by focusing on vocabulary extension, a technique that involves expanding the target vocabulary to incorporate domain-specific biomedical terms.

Abstract

Working within specific NLP subdomains presents significant challenges, primarily due to a persistent deficit of data. Stringent privacy concerns and limited data accessibility often drive this shortage. Additionally, the medical domain demands high accuracy, where even marginal improvements in model performance can have profound impacts. In this study, we investigate the potential of vocabulary transfer to enhance model performance in biomedical NLP tasks. Specifically, we focus on vocabulary extension, a technique that involves expanding the target vocabulary to incorporate domain-specific biomedical terms. Our findings demonstrate that vocabulary extension, leads to measurable improvements in both downstream model performance and inference time.

Vocabulary Transfer for Biomedical Texts: Add Tokens if You Can Not Add Data

TL;DR

This study investigates the potential of vocabulary transfer to enhance model performance in biomedical NLP tasks by focusing on vocabulary extension, a technique that involves expanding the target vocabulary to incorporate domain-specific biomedical terms.

Abstract

Working within specific NLP subdomains presents significant challenges, primarily due to a persistent deficit of data. Stringent privacy concerns and limited data accessibility often drive this shortage. Additionally, the medical domain demands high accuracy, where even marginal improvements in model performance can have profound impacts. In this study, we investigate the potential of vocabulary transfer to enhance model performance in biomedical NLP tasks. Specifically, we focus on vocabulary extension, a technique that involves expanding the target vocabulary to incorporate domain-specific biomedical terms. Our findings demonstrate that vocabulary extension, leads to measurable improvements in both downstream model performance and inference time.
Paper Structure (9 sections, 4 figures, 3 tables)

This paper contains 9 sections, 4 figures, 3 tables.

Figures (4)

  • Figure 1: Performance on Ohsumed data, vocabulary size is 32 000
  • Figure 2: Performance on Kaggle Medical dataset, vocabulary size is 32 000
  • Figure 3: Change of classifier accuracy on Kaggle Medical dataset, inference time with respect to vocabulary size + VIPI only.
  • Figure 4: Relative change in accuracy of downstream classifiers on Kaggle Medical dataset, inference time with respect to vocabulary size after MLM and VIPI