Multilingual Instruction Tuning With Just a Pinch of Multilinguality
Uri Shaham, Jonathan Herzig, Roee Aharoni, Idan Szpektor, Reut Tsarfaty, Matan Eyal
TL;DR
This work addresses the challenge of enabling instruction-following across many languages with multilingual LLMs. It demonstrates that instruction tuning in a single language yields cross-language transfer, and that introducing a surprisingly small amount of multilingual data (as few as ~40 multilingual examples) substantially boosts multilingual instruction-following, including for languages not seen during tuning. Increasing the number of languages in the tuning set further improves cross-lingual generalization, with benefits saturating after a few languages and even bilingual tuning aiding transfer beyond the trained pair. The study also examines potential predictors of transfer, finding no strong link to language similarity or pre-training data share, suggesting a robust, data-efficient path to multilingual instruction-following with minimal multilingual supervision. Collectively, the results offer practical guidelines for building scalable multilingual instruction-tuned LLMs that preserve English performance while generalizing to new languages.
Abstract
As instruction-tuned large language models (LLMs) gain global adoption, their ability to follow instructions in multiple languages becomes increasingly crucial. In this work, we investigate how multilinguality during instruction tuning of a multilingual LLM affects instruction-following across languages from the pre-training corpus. We first show that many languages transfer some instruction-following capabilities to other languages from even monolingual tuning. Furthermore, we find that only 40 multilingual examples integrated in an English tuning set substantially improve multilingual instruction-following, both in seen and unseen languages during tuning. In general, we observe that models tuned on multilingual mixtures exhibit comparable or superior performance in multiple languages compared to monolingually tuned models, despite training on 10x fewer examples in those languages. Finally, we find that diversifying the instruction tuning set with even just 2-4 languages significantly improves cross-lingual generalization. Our results suggest that building massively multilingual instruction-tuned models can be done with only a very small set of multilingual instruction-responses.
