Table of Contents
Fetching ...

Turning English-centric LLMs Into Polyglots: How Much Multilinguality Is Needed?

Tannon Kew, Florian Schottmann, Rico Sennrich

TL;DR

This work investigates the minimal amount of multilinguality required during finetuning to elicit cross-lingual generalisation in English-centric LLMs and finds that multilingual instruction tuning with as few as two to three languages is both necessary and sufficient to elicit effective cross-lingual generalisation.

Abstract

The vast majority of today's large language models (LLMs) are English-centric, having been pretrained predominantly on English text. Yet, in order to meet user expectations, models need to be able to respond appropriately in multiple languages once deployed in downstream applications. This requires strong cross-lingual transfer abilities. In this work, we investigate the minimal amount of multilinguality required during finetuning to elicit cross-lingual generalisation in English-centric LLMs. In experiments across four LLMs, we find that multilingual instruction tuning with as few as two to three languages is both necessary and sufficient to elicit effective cross-lingual generalisation, with the limiting factor being the degree to which a target language is seen during pretraining. Evaluations on five different tasks further reveal that multilingual instruction tuning is most beneficial for generative tasks that assume input/output language agreement, such as in chat settings, while being of less importance for highly structured classification-style tasks. Our code and data is available at https://github.com/ZurichNLP/multilingual-instruction-tuning.

Turning English-centric LLMs Into Polyglots: How Much Multilinguality Is Needed?

TL;DR

This work investigates the minimal amount of multilinguality required during finetuning to elicit cross-lingual generalisation in English-centric LLMs and finds that multilingual instruction tuning with as few as two to three languages is both necessary and sufficient to elicit effective cross-lingual generalisation.

Abstract

The vast majority of today's large language models (LLMs) are English-centric, having been pretrained predominantly on English text. Yet, in order to meet user expectations, models need to be able to respond appropriately in multiple languages once deployed in downstream applications. This requires strong cross-lingual transfer abilities. In this work, we investigate the minimal amount of multilinguality required during finetuning to elicit cross-lingual generalisation in English-centric LLMs. In experiments across four LLMs, we find that multilingual instruction tuning with as few as two to three languages is both necessary and sufficient to elicit effective cross-lingual generalisation, with the limiting factor being the degree to which a target language is seen during pretraining. Evaluations on five different tasks further reveal that multilingual instruction tuning is most beneficial for generative tasks that assume input/output language agreement, such as in chat settings, while being of less importance for highly structured classification-style tasks. Our code and data is available at https://github.com/ZurichNLP/multilingual-instruction-tuning.
Paper Structure (41 sections, 20 figures, 7 tables)

This paper contains 41 sections, 20 figures, 7 tables.

Figures (20)

  • Figure 1: Input/output (IO) language agreement for English (en), German (de), Bulgarian (bg), and Icelandic (is) given English-only instruction tuningono) or andmultilingual instruction tuning (Multi-Guanaco). Striped bars indicate that the target language is not seen during finetuning (i.e. the 0-shot setting). Error bars show a confidence interval of 95%.
  • Figure 2: Average helpfulness of single-turn dialogue responses from Llama 2 7b given incremental multilingual instruction tuning. Striped bars indicate a 0-shot setting and error bars show a confidence interval of 95%.
  • Figure 3: SARI weighted by IO language agreement for Llama 2 7b given incremental multilingual instruction tuning. Results are shown for both cross-lingual prompting (en:xx) and monolingual prompting (xx:xx). Striped bars indicate a 0-shot setting and error bars show a confidence interval of 95%.
  • Figure 4: XQuAD results for Llama 2 7b given incremental multilingual instruction tuning. Results are shown for both cross-lingual prompting (en:xx) and monolingual prompting (xx:xx). Striped bars indicate a 0-shot setting and error bars show a confidence interval of 95%.
  • Figure 5: X-CSQA results for Llama 2 7b given incremental multilingual instruction tuning. Striped bars indicate a 0-shot setting and error bars show a confidence interval of 95%.
  • ...and 15 more figures