Table of Contents
Fetching ...

How Many Languages Make Good Multilingual Instruction Tuning? A Case Study on BLOOM

Shaoxiong Ji, Pinzhen Chen

TL;DR

The paper investigates how the number of languages used in multilingual instruction tuning (mIT) affects downstream performance, using BLOOM-7B1 fine-tuned on a 52-language Bactrian-X dataset. The authors incrementally add languages, control data size with parallel translations, and evaluate on XCOPA, XStoryCloze, and XWinograd, analyzing the roles of language count, exposure, and similarity through seven similarity measures and lang2vec features. Key findings show that expanding language coverage generally improves outcomes, test languages included in IT often see accuracy gains, and genetic-based similarity better predicts transfer than language count, though patterns are benchmark- and language-dependent with notable outliers. The study emphasizes the need for systematic, controlled comparisons across base models, data, and evaluation protocols to guide multilingual IT in practice. It also highlights limitations to high-resource languages and calls for further work on regularization and broader coverage.

Abstract

Instruction tuning a large language model with multiple languages can prepare it for multilingual downstream tasks. Nonetheless, it is yet to be determined whether having a handful of languages is sufficient, or whether the benefits increase with the inclusion of more. By fine-tuning large multilingual models on 1 to 52 languages, we present a case study on BLOOM to understand three pertinent factors affecting performance: the number of languages, language exposure, and similarity between training and test languages. Overall we found that 1) expanding language coverage in multilingual instruction tuning proves to be beneficial; 2) accuracy often significantly boots if the test language appears in the instruction mixture; 3) languages' genetic features correlate with cross-lingual transfer more than merely the number of language but different languages benefit to various degrees.

How Many Languages Make Good Multilingual Instruction Tuning? A Case Study on BLOOM

TL;DR

The paper investigates how the number of languages used in multilingual instruction tuning (mIT) affects downstream performance, using BLOOM-7B1 fine-tuned on a 52-language Bactrian-X dataset. The authors incrementally add languages, control data size with parallel translations, and evaluate on XCOPA, XStoryCloze, and XWinograd, analyzing the roles of language count, exposure, and similarity through seven similarity measures and lang2vec features. Key findings show that expanding language coverage generally improves outcomes, test languages included in IT often see accuracy gains, and genetic-based similarity better predicts transfer than language count, though patterns are benchmark- and language-dependent with notable outliers. The study emphasizes the need for systematic, controlled comparisons across base models, data, and evaluation protocols to guide multilingual IT in practice. It also highlights limitations to high-resource languages and calls for further work on regularization and broader coverage.

Abstract

Instruction tuning a large language model with multiple languages can prepare it for multilingual downstream tasks. Nonetheless, it is yet to be determined whether having a handful of languages is sufficient, or whether the benefits increase with the inclusion of more. By fine-tuning large multilingual models on 1 to 52 languages, we present a case study on BLOOM to understand three pertinent factors affecting performance: the number of languages, language exposure, and similarity between training and test languages. Overall we found that 1) expanding language coverage in multilingual instruction tuning proves to be beneficial; 2) accuracy often significantly boots if the test language appears in the instruction mixture; 3) languages' genetic features correlate with cross-lingual transfer more than merely the number of language but different languages benefit to various degrees.
Paper Structure (22 sections, 6 figures, 3 tables)

This paper contains 22 sections, 6 figures, 3 tables.

Figures (6)

  • Figure 1: Test performance across all languages; x-axis: number of languages in mIT; y-axis: average accuracy.
  • Figure 2: Accuracy for English and Chinese on XStoryCloze, XWinograd, and XCOPA.
  • Figure 3: Accuracy for Quechuan, unseen both by the base model and during IT.
  • Figure 4: Accuracy on XCOPA for various languages, unseen by the base model but seen during IT. $\bigstar$ indicates the point the test language starts to be included in the mIT data. In most cases, performance can benefit (et, it, th) from the test language appearing in mIT despite outliers (tr).
  • Figure 5: Accuracy for Haitian on XCOPA and Basque on XStoryCloze, seen by base unseen during IT.
  • ...and 1 more figures