Table of Contents
Fetching ...

How Can We Effectively Expand the Vocabulary of LLMs with 0.01GB of Target Language Text?

Atsuki Yamaguchi, Aline Villavicencio, Nikolaos Aletras

TL;DR

The paper tackles the problem of expanding LLM vocabularies for extremely low-resource languages to reduce non-English inference costs. It systematically compares target parameter initialization strategies (Mean, Merge, Align, random, FOCUS) and training regimes (LoRA-based 2x2 LS, 2-stage tuning, clm vs mtp, shorter sequences) using only 30K sentences per language. The findings show that simple heuristic initializations (Mean/Align) combined with focused fine-tuning and shorter sequences can yield competitive generation performance with substantial inference speedups, while CPT-only continual pre-training often remains strong for generation tasks. The work also introduces ElChat as a post-hoc, training-free method to recover source-language capabilities after vocabulary expansion, highlighting practical remedies for real-world deployment in low-resource scenarios.

Abstract

Large language models (LLMs) have shown remarkable capabilities in many languages beyond English. Yet, LLMs require more inference steps when generating non-English text due to their reliance on English-centric tokenizers and vocabulary, resulting in higher usage costs to non-English speakers. Vocabulary expansion with target language tokens is a widely used cross-lingual vocabulary adaptation approach to remedy this issue. Despite its effectiveness in inference speedup, previous work on vocabulary expansion has focused on high-resource settings assuming access to a substantial amount of target language data to effectively initialize the embeddings of the new tokens and adapt the LLM to the target language. However, vocabulary expansion in low-resource settings has yet to be explored. In this article, we investigate vocabulary expansion in low-resource settings by considering embedding initialization methods and continual pre-training strategies. Through extensive experiments across typologically diverse languages, tasks and models, we establish a set of strategies to perform vocabulary expansion for faster inference, while striving to maintain competitive downstream performance to baselines. This is achieved with only 30K sentences ($\sim$0.01GB text data) from the target language.

How Can We Effectively Expand the Vocabulary of LLMs with 0.01GB of Target Language Text?

TL;DR

The paper tackles the problem of expanding LLM vocabularies for extremely low-resource languages to reduce non-English inference costs. It systematically compares target parameter initialization strategies (Mean, Merge, Align, random, FOCUS) and training regimes (LoRA-based 2x2 LS, 2-stage tuning, clm vs mtp, shorter sequences) using only 30K sentences per language. The findings show that simple heuristic initializations (Mean/Align) combined with focused fine-tuning and shorter sequences can yield competitive generation performance with substantial inference speedups, while CPT-only continual pre-training often remains strong for generation tasks. The work also introduces ElChat as a post-hoc, training-free method to recover source-language capabilities after vocabulary expansion, highlighting practical remedies for real-world deployment in low-resource scenarios.

Abstract

Large language models (LLMs) have shown remarkable capabilities in many languages beyond English. Yet, LLMs require more inference steps when generating non-English text due to their reliance on English-centric tokenizers and vocabulary, resulting in higher usage costs to non-English speakers. Vocabulary expansion with target language tokens is a widely used cross-lingual vocabulary adaptation approach to remedy this issue. Despite its effectiveness in inference speedup, previous work on vocabulary expansion has focused on high-resource settings assuming access to a substantial amount of target language data to effectively initialize the embeddings of the new tokens and adapt the LLM to the target language. However, vocabulary expansion in low-resource settings has yet to be explored. In this article, we investigate vocabulary expansion in low-resource settings by considering embedding initialization methods and continual pre-training strategies. Through extensive experiments across typologically diverse languages, tasks and models, we establish a set of strategies to perform vocabulary expansion for faster inference, while striving to maintain competitive downstream performance to baselines. This is achieved with only 30K sentences (0.01GB text data) from the target language.
Paper Structure (61 sections, 5 figures, 15 tables)

This paper contains 61 sections, 5 figures, 15 tables.

Figures (5)

  • Figure 1: We address the challenge of effectively expanding vocabulary for LLMs in low-resource settings. This is crucial for reducing inference steps when generating non-English text, as LLMs often rely on English-centric tokenizers and vocabulary. Our approach explores various adaptation strategies () to achieve inference speedups while aiming to retain competitive performance. Our recommended strategy combines heuristic-based parameter initialization for new tokens with fine-tuning the top and bottom two layers of the model, using a short input sequence length and a multi-token prediction objective pmlr-v235-gloeckle24a.
  • Figure 2: Average number of tokens on the FLORES-200 dev set across languages and models.
  • Figure 3: Downstream performance and inference speedup in (a) mt and (b) sum across different $|\mathcal{V}_\text{new}|$. Red and gray dotted lines denote Source and CPT-only.
  • Figure 4: Average target token ratio in input ($\times$) and output ($\bullet$) with respect to $|\mathcal{V}_\text{new}|$ across models, languages, and tasks.
  • Figure 5: Downstream performance and inference speedup in (a) mc and (b) gmmlu across different $|\mathcal{V}_\text{new}|$. Red and gray dotted lines denote Source and CPT-only.