Monolingual or Multilingual Instruction Tuning: Which Makes a Better Alpaca

Pinzhen Chen; Shaoxiong Ji; Nikolay Bogoychev; Andrey Kutuzov; Barry Haddow; Kenneth Heafield

Monolingual or Multilingual Instruction Tuning: Which Makes a Better Alpaca

Pinzhen Chen, Shaoxiong Ji, Nikolay Bogoychev, Andrey Kutuzov, Barry Haddow, Kenneth Heafield

TL;DR

This work investigates how to scale instruction tuning of large language models across multiple languages under a fixed computational budget. It compares monolingual versus multilingual instruction tuning using two tuning paradigms, LoRA and full-parameter fine-tuning, on data derived from the Alpaca dataset with machine-translated multilingual variants. Key findings show that multilingual tuning, particularly with LoRA, is often on par with or superior to language-specific tuning, and that downsampling multilingual data yields robust performance on unseen languages. The results offer practical guidance for expanding language support in open-source LLM ecosystems while highlighting the trade-offs between tuning strategies and model sizes.

Abstract

Foundational large language models (LLMs) can be instruction-tuned to perform open-domain question answering, facilitating applications like chat assistants. While such efforts are often carried out in a single language, we empirically analyze cost-efficient strategies for multilingual scenarios. Our study employs the Alpaca dataset and machine translations of it to form multilingual data, which is then used to tune LLMs through either low-rank adaptation or full-parameter training. Under a controlled computation budget, comparisons show that multilingual tuning is on par or better than tuning a model for each language. Furthermore, multilingual tuning with downsampled data can be as powerful and more robust. Our findings serve as a guide for expanding language support through instruction tuning.

Monolingual or Multilingual Instruction Tuning: Which Makes a Better Alpaca

TL;DR

Abstract

Paper Structure (26 sections, 8 figures, 3 tables)

This paper contains 26 sections, 8 figures, 3 tables.

Introduction
Methodology
Instruction data
Budget-controlled instruction tuning
Evaluation setup
Test data
LLM-as-a-judge
Language (in)consistency
Human-LLM agreement
Performance and Discussions
Model sizes
Budget-efficient tuning
Unseen languages
Language robustness
Model families
...and 11 more sections

Figures (8)

Figure 1: LoRA with BLOOM at different sizes. Caption: language; y-axis: score; x-axis: model size (B).
Figure 2: FFT with BLOOM at different sizes. Caption: language; y-axis: score; x-axis: model size (B). Same legend as \ref{['fig:lora-bloom-sizes']}.
Figure 3: LoRA and FFT with BLOOM at different sizes and tested on unseen languages. Caption: training method and language; y-axis: score; x-axis: model size (B).
Figure 4: Evaluation score change before and after language identification, averaged over six seen test languages, at different LLM sizes. Caption: training method and base model; y-axis: score difference (log scale); x-axis: model size (B).
Figure 5: LoRA fine-tuning on different 7B LLMs. Caption: language generated; y-axis: score; x-axis: model family.
...and 3 more figures

Monolingual or Multilingual Instruction Tuning: Which Makes a Better Alpaca

TL;DR

Abstract

Monolingual or Multilingual Instruction Tuning: Which Makes a Better Alpaca

Authors

TL;DR

Abstract

Table of Contents

Figures (8)