How Many Languages Make Good Multilingual Instruction Tuning? A Case Study on BLOOM
Shaoxiong Ji, Pinzhen Chen
TL;DR
The paper investigates how the number of languages used in multilingual instruction tuning (mIT) affects downstream performance, using BLOOM-7B1 fine-tuned on a 52-language Bactrian-X dataset. The authors incrementally add languages, control data size with parallel translations, and evaluate on XCOPA, XStoryCloze, and XWinograd, analyzing the roles of language count, exposure, and similarity through seven similarity measures and lang2vec features. Key findings show that expanding language coverage generally improves outcomes, test languages included in IT often see accuracy gains, and genetic-based similarity better predicts transfer than language count, though patterns are benchmark- and language-dependent with notable outliers. The study emphasizes the need for systematic, controlled comparisons across base models, data, and evaluation protocols to guide multilingual IT in practice. It also highlights limitations to high-resource languages and calls for further work on regularization and broader coverage.
Abstract
Instruction tuning a large language model with multiple languages can prepare it for multilingual downstream tasks. Nonetheless, it is yet to be determined whether having a handful of languages is sufficient, or whether the benefits increase with the inclusion of more. By fine-tuning large multilingual models on 1 to 52 languages, we present a case study on BLOOM to understand three pertinent factors affecting performance: the number of languages, language exposure, and similarity between training and test languages. Overall we found that 1) expanding language coverage in multilingual instruction tuning proves to be beneficial; 2) accuracy often significantly boots if the test language appears in the instruction mixture; 3) languages' genetic features correlate with cross-lingual transfer more than merely the number of language but different languages benefit to various degrees.
