Table of Contents
Fetching ...

Linguistically-Informed Multilingual Instruction Tuning: Is There an Optimal Set of Languages to Tune?

Gürkan Soykan, Gözde Gül Şahin

TL;DR

A method to select languages for instruction tuning in a linguistically informed way, aiming to boost model performance across languages and tasks, and shows that this careful selection generally leads to better outcomes than choosing languages at random.

Abstract

Multilingual language models often perform unevenly across different languages due to limited generalization capabilities for some languages. This issue is significant because of the growing interest in making universal language models that work well for all languages. Instruction tuning with multilingual instruction-response pairs has been used to improve model performance across various languages. However, this approach is challenged by high computational costs, a lack of quality tuning data for all languages, and the "curse of multilinguality" -- the performance drop per language after adding many languages. Recent studies have found that working with datasets with few languages and a smaller number of instances can be beneficial. Yet, there exists no systematic investigation into how choosing different languages affects multilingual instruction tuning. Our study proposes a method to select languages for instruction tuning in a linguistically informed way, aiming to boost model performance across languages and tasks. We use a simple algorithm to choose diverse languages and test their effectiveness on various benchmarks and open-ended questions. Our results show that this careful selection generally leads to better outcomes than choosing languages at random. We suggest a new and simple way of enhancing multilingual models by selecting diverse languages based on linguistic features that could help develop better multilingual systems and guide dataset creation efforts. All resources, including the code for language selection and multilingual instruction tuning, are made available in our official repository at https://github.com/GGLAB-KU/ling-informed-mit enabling reproducibility and further research in this area.

Linguistically-Informed Multilingual Instruction Tuning: Is There an Optimal Set of Languages to Tune?

TL;DR

A method to select languages for instruction tuning in a linguistically informed way, aiming to boost model performance across languages and tasks, and shows that this careful selection generally leads to better outcomes than choosing languages at random.

Abstract

Multilingual language models often perform unevenly across different languages due to limited generalization capabilities for some languages. This issue is significant because of the growing interest in making universal language models that work well for all languages. Instruction tuning with multilingual instruction-response pairs has been used to improve model performance across various languages. However, this approach is challenged by high computational costs, a lack of quality tuning data for all languages, and the "curse of multilinguality" -- the performance drop per language after adding many languages. Recent studies have found that working with datasets with few languages and a smaller number of instances can be beneficial. Yet, there exists no systematic investigation into how choosing different languages affects multilingual instruction tuning. Our study proposes a method to select languages for instruction tuning in a linguistically informed way, aiming to boost model performance across languages and tasks. We use a simple algorithm to choose diverse languages and test their effectiveness on various benchmarks and open-ended questions. Our results show that this careful selection generally leads to better outcomes than choosing languages at random. We suggest a new and simple way of enhancing multilingual models by selecting diverse languages based on linguistic features that could help develop better multilingual systems and guide dataset creation efforts. All resources, including the code for language selection and multilingual instruction tuning, are made available in our official repository at https://github.com/GGLAB-KU/ling-informed-mit enabling reproducibility and further research in this area.

Paper Structure

This paper contains 21 sections, 2 equations, 7 figures, 12 tables, 1 algorithm.

Figures (7)

  • Figure 1: Average performance of models instruction-tuned with various language subsets on natural language understanding and commonsense reasoning tasks, based on eight languages not present in any language subset (unseen).
  • Figure 2: Comparison of average zero-shot performance across model sizes, demonstrating the scaling law and performance trends by language subset selection.
  • Figure 3: Average performance of BLOOM-7B and mGPT models with confidence intervals trained with varying numbers of languages, based on a geographical feature vector using our language selection algorithm, across natural language understanding and commonsense reasoning tasks.
  • Figure 4: Effect of varying number of languages on different benchmarks for the mGPT model.
  • Figure 5: Effect of varying number of languages on different benchmarks for the BLOOM 7B model.
  • ...and 2 more figures