How Can We Effectively Expand the Vocabulary of LLMs with 0.01GB of Target Language Text?
Atsuki Yamaguchi, Aline Villavicencio, Nikolaos Aletras
TL;DR
The paper tackles the problem of expanding LLM vocabularies for extremely low-resource languages to reduce non-English inference costs. It systematically compares target parameter initialization strategies (Mean, Merge, Align, random, FOCUS) and training regimes (LoRA-based 2x2 LS, 2-stage tuning, clm vs mtp, shorter sequences) using only 30K sentences per language. The findings show that simple heuristic initializations (Mean/Align) combined with focused fine-tuning and shorter sequences can yield competitive generation performance with substantial inference speedups, while CPT-only continual pre-training often remains strong for generation tasks. The work also introduces ElChat as a post-hoc, training-free method to recover source-language capabilities after vocabulary expansion, highlighting practical remedies for real-world deployment in low-resource scenarios.
Abstract
Large language models (LLMs) have shown remarkable capabilities in many languages beyond English. Yet, LLMs require more inference steps when generating non-English text due to their reliance on English-centric tokenizers and vocabulary, resulting in higher usage costs to non-English speakers. Vocabulary expansion with target language tokens is a widely used cross-lingual vocabulary adaptation approach to remedy this issue. Despite its effectiveness in inference speedup, previous work on vocabulary expansion has focused on high-resource settings assuming access to a substantial amount of target language data to effectively initialize the embeddings of the new tokens and adapt the LLM to the target language. However, vocabulary expansion in low-resource settings has yet to be explored. In this article, we investigate vocabulary expansion in low-resource settings by considering embedding initialization methods and continual pre-training strategies. Through extensive experiments across typologically diverse languages, tasks and models, we establish a set of strategies to perform vocabulary expansion for faster inference, while striving to maintain competitive downstream performance to baselines. This is achieved with only 30K sentences ($\sim$0.01GB text data) from the target language.
