Table of Contents
Fetching ...

How Vocabulary Sharing Facilitates Multilingualism in LLaMA?

Fei Yuan, Shuai Yuan, Zhiyong Wu, Lei Li

TL;DR

This work probes why English-centric LLMs can tackle many non-English languages by focusing on vocabulary sharing via embedding-only fine-tuning (Embed FT) in LLaMA. By fine-tuning on 10k en→x bilingual data across 101 languages and evaluating bilingual vs. multilingual performance on Flores-101, the authors uncover four stable quadrants—Reciprocal, Altruistic, Stagnant, and Selfish—each with distinct tuning strategies and outcomes. They demonstrate that vocabulary design and tokenization critically shape cross-lingual transfer, with practical quadrant-specific guidelines (e.g., Embed FT for reciprocal languages, small Full FT for altruistic, and subword-shortening for stagnant languages) and even a post-tokenization fix that yields significant gains. The findings offer data-efficient, language-aware routes to amplify multilingual capabilities in LLMs, informing deployment and further research on cross-language generalization across diverse linguistic families.

Abstract

Large Language Models (LLMs), often show strong performance on English tasks, while exhibiting limitations on other languages. What is an LLM's multilingual capability when it is trained only on certain languages? The underlying mechanism remains unclear. This study endeavors to examine the multilingual capability of LLMs from the vocabulary sharing perspective by conducting an exhaustive analysis across 101 languages. Through the investigation of the performance gap before and after embedding fine-tuning, we discovered four distinct quadrants. By delving into each quadrant we provide actionable and efficient guidelines for tuning these languages. Extensive experiments reveal that existing LLMs possess multilingual capabilities that surpass our expectations, and we can significantly improve the multilingual performance of LLMs based on these attributes of each quadrant~\footnote{\url{https://github.com/CONE-MT/Vocabulary-Sharing-Facilitates-Multilingualism}.}.

How Vocabulary Sharing Facilitates Multilingualism in LLaMA?

TL;DR

This work probes why English-centric LLMs can tackle many non-English languages by focusing on vocabulary sharing via embedding-only fine-tuning (Embed FT) in LLaMA. By fine-tuning on 10k en→x bilingual data across 101 languages and evaluating bilingual vs. multilingual performance on Flores-101, the authors uncover four stable quadrants—Reciprocal, Altruistic, Stagnant, and Selfish—each with distinct tuning strategies and outcomes. They demonstrate that vocabulary design and tokenization critically shape cross-lingual transfer, with practical quadrant-specific guidelines (e.g., Embed FT for reciprocal languages, small Full FT for altruistic, and subword-shortening for stagnant languages) and even a post-tokenization fix that yields significant gains. The findings offer data-efficient, language-aware routes to amplify multilingual capabilities in LLMs, informing deployment and further research on cross-language generalization across diverse linguistic families.

Abstract

Large Language Models (LLMs), often show strong performance on English tasks, while exhibiting limitations on other languages. What is an LLM's multilingual capability when it is trained only on certain languages? The underlying mechanism remains unclear. This study endeavors to examine the multilingual capability of LLMs from the vocabulary sharing perspective by conducting an exhaustive analysis across 101 languages. Through the investigation of the performance gap before and after embedding fine-tuning, we discovered four distinct quadrants. By delving into each quadrant we provide actionable and efficient guidelines for tuning these languages. Extensive experiments reveal that existing LLMs possess multilingual capabilities that surpass our expectations, and we can significantly improve the multilingual performance of LLMs based on these attributes of each quadrant~\footnote{\url{https://github.com/CONE-MT/Vocabulary-Sharing-Facilitates-Multilingualism}.}.
Paper Structure (42 sections, 2 equations, 6 figures, 16 tables)

This paper contains 42 sections, 2 equations, 6 figures, 16 tables.

Figures (6)

  • Figure 1: Multilingual capability quadrant. This graph, based on the TED dataset, plots the performance of models fine-tuned with bilingual instructions. Each point represents a model’s performance gain over the original LLaMA. The horizontal axis measures the improvement in bilingual performance, while the vertical axis indicates the enhancement in multilingual performance.
  • Figure 2: We evaluated the multilingual capabilities of various models on the Flores-101 dataset. The bar graph represents the direct inference results from the original LLaMA, while the line graph illustrates the multilingual performance of models trained on bilingual instruction data from en$\rightarrow$ro, en$\rightarrow$ms, en$\rightarrow$no, and en$\rightarrow$luo.
  • Figure 3: Comparing the Embed FT and Full FT Strategies. In the realm of bilingual performance, both strategies prove equally effective. However, when it comes to multilingual performance, the Embed FT strategy stands out for its adaptability across various languages, while the Full FT strategy tends to over-specialize the model to a single language. The numerical results for each language pair can be found in Appendix \ref{['appendix:single-layer']}.
  • Figure 4: Analyzing linguistics in altruistic languages. A significant overlap in tokenized results with English may enhance performance in Indo-European languages.
  • Figure 5: Hyper-parameter setting. "Threshold" refers to the significant changes before and after tuning, which are calculated by dividing the performance after tuning by the performance before the tuning. "# Reciprocal" denotes the count of languages in the Reciprocal quadrant. The experimental result demonstrates that a substantial increase in the threshold value could lead to all languages being classified into the Stagnant quadrant.
  • ...and 1 more figures