How Can We Effectively Expand the Vocabulary of LLMs with 0.01GB of Target Language Text?

Atsuki Yamaguchi; Aline Villavicencio; Nikolaos Aletras

How Can We Effectively Expand the Vocabulary of LLMs with 0.01GB of Target Language Text?

Atsuki Yamaguchi, Aline Villavicencio, Nikolaos Aletras

TL;DR

The paper tackles the problem of expanding LLM vocabularies for extremely low-resource languages to reduce non-English inference costs. It systematically compares target parameter initialization strategies (Mean, Merge, Align, random, FOCUS) and training regimes (LoRA-based 2x2 LS, 2-stage tuning, clm vs mtp, shorter sequences) using only 30K sentences per language. The findings show that simple heuristic initializations (Mean/Align) combined with focused fine-tuning and shorter sequences can yield competitive generation performance with substantial inference speedups, while CPT-only continual pre-training often remains strong for generation tasks. The work also introduces ElChat as a post-hoc, training-free method to recover source-language capabilities after vocabulary expansion, highlighting practical remedies for real-world deployment in low-resource scenarios.

Abstract

Large language models (LLMs) have shown remarkable capabilities in many languages beyond English. Yet, LLMs require more inference steps when generating non-English text due to their reliance on English-centric tokenizers and vocabulary, resulting in higher usage costs to non-English speakers. Vocabulary expansion with target language tokens is a widely used cross-lingual vocabulary adaptation approach to remedy this issue. Despite its effectiveness in inference speedup, previous work on vocabulary expansion has focused on high-resource settings assuming access to a substantial amount of target language data to effectively initialize the embeddings of the new tokens and adapt the LLM to the target language. However, vocabulary expansion in low-resource settings has yet to be explored. In this article, we investigate vocabulary expansion in low-resource settings by considering embedding initialization methods and continual pre-training strategies. Through extensive experiments across typologically diverse languages, tasks and models, we establish a set of strategies to perform vocabulary expansion for faster inference, while striving to maintain competitive downstream performance to baselines. This is achieved with only 30K sentences ($\sim$0.01GB text data) from the target language.

How Can We Effectively Expand the Vocabulary of LLMs with 0.01GB of Target Language Text?

TL;DR

Abstract

0.01GB text data) from the target language.

Paper Structure (61 sections, 5 figures, 15 tables)

This paper contains 61 sections, 5 figures, 15 tables.

Introduction
Background
Text Overfragmentation
Cross-lingual Vocabulary Adaptation
Problem Statement
Target Parameter Initialization
Random Initialization
Initialization Based on Auxiliary Models
Heuristic-based Initialization
Mean
Merge
Align
Training Strategy
Training Procedure
Objective Function
...and 46 more sections

Figures (5)

Figure 1: We address the challenge of effectively expanding vocabulary for LLMs in low-resource settings. This is crucial for reducing inference steps when generating non-English text, as LLMs often rely on English-centric tokenizers and vocabulary. Our approach explores various adaptation strategies () to achieve inference speedups while aiming to retain competitive performance. Our recommended strategy combines heuristic-based parameter initialization for new tokens with fine-tuning the top and bottom two layers of the model, using a short input sequence length and a multi-token prediction objective pmlr-v235-gloeckle24a.
Figure 2: Average number of tokens on the FLORES-200 dev set across languages and models.
Figure 3: Downstream performance and inference speedup in (a) mt and (b) sum across different $|\mathcal{V}_\text{new}|$. Red and gray dotted lines denote Source and CPT-only.
Figure 4: Average target token ratio in input ($\times$) and output ($\bullet$) with respect to $|\mathcal{V}_\text{new}|$ across models, languages, and tasks.
Figure 5: Downstream performance and inference speedup in (a) mc and (b) gmmlu across different $|\mathcal{V}_\text{new}|$. Red and gray dotted lines denote Source and CPT-only.

How Can We Effectively Expand the Vocabulary of LLMs with 0.01GB of Target Language Text?

TL;DR

Abstract

How Can We Effectively Expand the Vocabulary of LLMs with 0.01GB of Target Language Text?

Authors

TL;DR

Abstract

Table of Contents

Figures (5)