Multilingual Instruction Tuning With Just a Pinch of Multilinguality

Uri Shaham; Jonathan Herzig; Roee Aharoni; Idan Szpektor; Reut Tsarfaty; Matan Eyal

Multilingual Instruction Tuning With Just a Pinch of Multilinguality

Uri Shaham, Jonathan Herzig, Roee Aharoni, Idan Szpektor, Reut Tsarfaty, Matan Eyal

TL;DR

This work addresses the challenge of enabling instruction-following across many languages with multilingual LLMs. It demonstrates that instruction tuning in a single language yields cross-language transfer, and that introducing a surprisingly small amount of multilingual data (as few as ~40 multilingual examples) substantially boosts multilingual instruction-following, including for languages not seen during tuning. Increasing the number of languages in the tuning set further improves cross-lingual generalization, with benefits saturating after a few languages and even bilingual tuning aiding transfer beyond the trained pair. The study also examines potential predictors of transfer, finding no strong link to language similarity or pre-training data share, suggesting a robust, data-efficient path to multilingual instruction-following with minimal multilingual supervision. Collectively, the results offer practical guidelines for building scalable multilingual instruction-tuned LLMs that preserve English performance while generalizing to new languages.

Abstract

As instruction-tuned large language models (LLMs) gain global adoption, their ability to follow instructions in multiple languages becomes increasingly crucial. In this work, we investigate how multilinguality during instruction tuning of a multilingual LLM affects instruction-following across languages from the pre-training corpus. We first show that many languages transfer some instruction-following capabilities to other languages from even monolingual tuning. Furthermore, we find that only 40 multilingual examples integrated in an English tuning set substantially improve multilingual instruction-following, both in seen and unseen languages during tuning. In general, we observe that models tuned on multilingual mixtures exhibit comparable or superior performance in multiple languages compared to monolingually tuned models, despite training on 10x fewer examples in those languages. Finally, we find that diversifying the instruction tuning set with even just 2-4 languages significantly improves cross-lingual generalization. Our results suggest that building massively multilingual instruction-tuned models can be done with only a very small set of multilingual instruction-responses.

Multilingual Instruction Tuning With Just a Pinch of Multilinguality

TL;DR

Abstract

Paper Structure (41 sections, 1 equation, 10 figures, 4 tables)

This paper contains 41 sections, 1 equation, 10 figures, 4 tables.

Introduction
Measuring Multilingual Instruction-Following
Data
Evaluation
Instruction-Following Score Per Language
Model
Human Validation
How Much Multilinguality Is Needed For Multilingual Instruction Tuning?
Monolingual Instruction Tuning Yields Multilingual Abilities
Setup
Results
A Few Dozen Examples Improve Multilingual Instruction-following
Setup
Results
A Few Dozen Examples Improve Cross-lingual Generalization
...and 26 more sections

Figures (10)

Figure 1: Per language instruction-following scores of models instruction-tuned on monolingual data. Each row represents a model tuned using a different language, and each column is an individual heatmap of the scores of all models on the same evaluation language. Scores are the discounted-ties weighted average of the side-by-side scores against the model tuned on the evaluation language. The scores along the diagonal are 50 as they are the result of comparing generations to themselves, and are excluded from the heatmap coloring.
Figure 2: Human annotators rating distributions of models responses across languages. Each row describes evaluation in its corresponding language of the model tuned monolingually using that language. Numbers in the first row are reported by zhou2023lima.
Figure 3: Instruction-following scores of models trained using when $P\%$ of the training set is distributed uniformly across 12 languages and an $(100-P)\%$ is English only. Each X axis tick represents a tuning mixture, scores over individual non-English languages are in blue, and their averages are in red. English scores are in orange.
Figure 4: Instruction-following scores of models tuned when $P\%$ of the training set is distributed uniformly across 6 languages and an $(100-P)\%$ is English only. Each X axis tick represents such a tuning set, scores over individual non-English languages are in blue and English scores are in orange. Average scores of the 5 non-English languages in the tuning set are in red, and the average scores of the 6 languages not seen during tuning are in green.
Figure 5: Instruction-following scores in Czech, Estonian, Hebrew, Hindi, Spanish, and Chinese of models instruction-tuned using various subsets of Arabic, English, Finnish, Italian, Russian, and Swahili. Blue markers are the average scores per evaluation languages across models tuned with the same number of languages. The averages of those individual languages scores are in green.
...and 5 more figures

Multilingual Instruction Tuning With Just a Pinch of Multilinguality

TL;DR

Abstract

Multilingual Instruction Tuning With Just a Pinch of Multilinguality

Authors

TL;DR

Abstract

Table of Contents

Figures (10)