Table of Contents
Fetching ...

Adaptive BPE Tokenization for Enhanced Vocabulary Adaptation in Finetuning Pretrained Language Models

Gunjan Balde, Soumyadeep Roy, Mainack Mondal, Niloy Ganguly

TL;DR

This work identifies a fundamental ill-tokenization issue in BPE-based vocabulary adaptation when domain-specific terms are appended to a PLM's vocabulary. It proposes AdaptBPE, which initializes tokenization by longest-substring matching against the added domain vocabulary before character-level splitting, improving domain-term tokenization without depending on how the domain vocabulary is constructed. Across eight datasets spanning classification and medical summarization, AdaptBPE yields notable improvements in accuracy and Rouge-L, along with substantial reductions in fragment scores, and is complemented by a positive human evaluation in the medical domain. The study highlights the importance of tokenization strategy in domain adaptation and provides an open-source implementation to facilitate adoption.

Abstract

In this work, we show a fundamental limitation in vocabulary adaptation approaches that use Byte-Pair Encoding (BPE) tokenization scheme for fine-tuning pretrained language models (PLMs) to expert domains. Current approaches trivially append the target domain-specific vocabulary at the end of the PLM vocabulary. This approach leads to a lower priority score and causes sub-optimal tokenization in BPE that iteratively uses merge rules to tokenize a given text. To mitigate this issue, we propose AdaptBPE where the BPE tokenization initialization phase is modified to first perform the longest string matching on the added (target) vocabulary before tokenizing at the character level. We perform an extensive evaluation of AdaptBPE versus the standard BPE over various classification and summarization tasks; AdaptBPE improves by 3.57% (in terms of accuracy) and 1.87% (in terms of Rouge-L), respectively. AdaptBPE for MEDVOC works particularly well when reference summaries have high OOV concentration or are longer in length. We also conduct a human evaluation, revealing that AdaptBPE generates more relevant and more faithful summaries as compared to MEDVOC. We make our codebase publicly available at https://github.com/gb-kgp/adaptbpe.

Adaptive BPE Tokenization for Enhanced Vocabulary Adaptation in Finetuning Pretrained Language Models

TL;DR

This work identifies a fundamental ill-tokenization issue in BPE-based vocabulary adaptation when domain-specific terms are appended to a PLM's vocabulary. It proposes AdaptBPE, which initializes tokenization by longest-substring matching against the added domain vocabulary before character-level splitting, improving domain-term tokenization without depending on how the domain vocabulary is constructed. Across eight datasets spanning classification and medical summarization, AdaptBPE yields notable improvements in accuracy and Rouge-L, along with substantial reductions in fragment scores, and is complemented by a positive human evaluation in the medical domain. The study highlights the importance of tokenization strategy in domain adaptation and provides an open-source implementation to facilitate adoption.

Abstract

In this work, we show a fundamental limitation in vocabulary adaptation approaches that use Byte-Pair Encoding (BPE) tokenization scheme for fine-tuning pretrained language models (PLMs) to expert domains. Current approaches trivially append the target domain-specific vocabulary at the end of the PLM vocabulary. This approach leads to a lower priority score and causes sub-optimal tokenization in BPE that iteratively uses merge rules to tokenize a given text. To mitigate this issue, we propose AdaptBPE where the BPE tokenization initialization phase is modified to first perform the longest string matching on the added (target) vocabulary before tokenizing at the character level. We perform an extensive evaluation of AdaptBPE versus the standard BPE over various classification and summarization tasks; AdaptBPE improves by 3.57% (in terms of accuracy) and 1.87% (in terms of Rouge-L), respectively. AdaptBPE for MEDVOC works particularly well when reference summaries have high OOV concentration or are longer in length. We also conduct a human evaluation, revealing that AdaptBPE generates more relevant and more faithful summaries as compared to MEDVOC. We make our codebase publicly available at https://github.com/gb-kgp/adaptbpe.
Paper Structure (26 sections, 3 figures, 8 tables, 1 algorithm)

This paper contains 26 sections, 3 figures, 8 tables, 1 algorithm.

Figures (3)

  • Figure 1: AdaptBPE modifies the initialization step of standard BPE by merging the characters that match with the extended vocabulary (VDOMAIN). The incorrect merge step of BPE for tokenizing the word hypercholesterolemia is highlighted by a red dashed box.
  • Figure 2: Human evaluation scores comparison over 40 randomly selected test data points. AdaptBPE produces more relevant, coherent, and faithful summaries during human evaluation with medical experts.
  • Figure 3: Instruction window as seen by an annotator participating in the study.