Table of Contents
Fetching ...

Fast Vocabulary Transfer for Language Model Compression

Leonidas Gee, Andrea Zugarini, Leonardo Rigutini, Paolo Torroni

TL;DR

The paper tackles the high cost of large pre-trained language models by introducing Fast Vocabulary Transfer (FVT), a lightweight method to adapt general-domain LMs to smaller, in-domain tokenizers. By initializing in-domain embeddings from a general LM and then fine-tuning with masked language modeling and downstream tasks, FVT reduces model size and speeds up inference while preserving performance, particularly in specialized domains like medicine and law. The study demonstrates that FVT is complementary to knowledge distillation (KD), enabling further compression up to approximately 2.75x without substantial accuracy loss. Overall, VT (and specifically FVT) offers a practical, orthogonal avenue for scalable model deployment across vertical domains, with potential for deeper integration with KD in future work.

Abstract

Real-world business applications require a trade-off between language model performance and size. We propose a new method for model compression that relies on vocabulary transfer. We evaluate the method on various vertical domains and downstream tasks. Our results indicate that vocabulary transfer can be effectively used in combination with other compression techniques, yielding a significant reduction in model size and inference time while marginally compromising on performance.

Fast Vocabulary Transfer for Language Model Compression

TL;DR

The paper tackles the high cost of large pre-trained language models by introducing Fast Vocabulary Transfer (FVT), a lightweight method to adapt general-domain LMs to smaller, in-domain tokenizers. By initializing in-domain embeddings from a general LM and then fine-tuning with masked language modeling and downstream tasks, FVT reduces model size and speeds up inference while preserving performance, particularly in specialized domains like medicine and law. The study demonstrates that FVT is complementary to knowledge distillation (KD), enabling further compression up to approximately 2.75x without substantial accuracy loss. Overall, VT (and specifically FVT) offers a practical, orthogonal avenue for scalable model deployment across vertical domains, with potential for deeper integration with KD in future work.

Abstract

Real-world business applications require a trade-off between language model performance and size. We propose a new method for model compression that relies on vocabulary transfer. We evaluate the method on various vertical domains and downstream tasks. Our results indicate that vocabulary transfer can be effectively used in combination with other compression techniques, yielding a significant reduction in model size and inference time while marginally compromising on performance.
Paper Structure (16 sections, 2 equations, 4 figures, 5 tables)

This paper contains 16 sections, 2 equations, 4 figures, 5 tables.

Figures (4)

  • Figure 1: Sketch of the VT procedure. First, the vocabulary is constructed on the in-domain data, then an embedding is assigned to each token, transferring information from the pre-trained representations of the general-purpose language model.
  • Figure 2: Example of different tokenizations using a pre-trained or an adapted tokenizer. In the latter case, domain-specific words are not broken down into multiple word pieces.
  • Figure 3: Sequence length distribution of each tokenizer on ADE, LEDGAR and CoNLL03 (left to right).
  • Figure 4: F1-score vs model size of VT with or without KD on ADE. VT and KD together can further compress a LM's size in exchange for a limited performance drop. FVT is better than PVT. A smaller vocabulary size does not always imply a lower performance.