Prune or Retrain: Optimizing the Vocabulary of Multilingual Models for Estonian
Aleksei Dorkin, Taido Purason, Kairit Sirts
TL;DR
This work investigates vocabulary optimization for Estonian in a multilingual encoder (mDeBERTa v3) by comparing two strategies: retraining the tokenizer and pruning unused tokens. Despite achieving substantial tokenization and parameter efficiency with a language-specific vocabulary, retraining the tokenizer consistently degrades NER performance even with subsequent LoRA-based continuation, suggesting longer embedding and training may be needed. In contrast, vocabulary pruning reduces size without harming NER accuracy, indicating a practical path to efficiency for monolingual use of multilingual models. Overall, pruning offers tangible efficiency gains with preserved downstream performance, while tokenizer replacement requires more extensive re-tuning to be beneficial in NER tasks.
Abstract
Adapting multilingual language models to specific languages can enhance both their efficiency and performance. In this study, we explore how modifying the vocabulary of a multilingual encoder model to better suit the Estonian language affects its downstream performance on the Named Entity Recognition (NER) task. The motivations for adjusting the vocabulary are twofold: practical benefits affecting the computational cost, such as reducing the input sequence length and the model size, and performance enhancements by tailoring the vocabulary to the particular language. We evaluate the effectiveness of two vocabulary adaptation approaches -- retraining the tokenizer and pruning unused tokens -- and assess their impact on the model's performance, particularly after continual training. While retraining the tokenizer degraded the performance of the NER task, suggesting that longer embedding tuning might be needed, we observed no negative effects on pruning.
