Zero-shot cross-lingual transfer in instruction tuning of large language models

Nadezhda Chirkova; Vassilina Nikoulina

Zero-shot cross-lingual transfer in instruction tuning of large language models

Nadezhda Chirkova, Vassilina Nikoulina

TL;DR

This work investigates zero-shot cross-lingual transfer in instruction tuning by training LLMs on English instruction data and testing on prompts in other languages without target-language adaptation. It introduces a multi-facet evaluation framework combining human judgments and GPT-3.5 scoring across languages and criteria, and systematically analyzes how base model choice, IT data size, and hyperparameters influence cross-lingual transfer. The key finding is that cross-lingual transfer is feasible for English-centric IT when multilinguality is reflected in hyperparameter tuning and sufficient IT data, though non-English responses encounter lower factuality and occasional fluency issues. The study provides practical guidance on LR tuning and data requirements to enable multilingual instruction-following in open-source LLMs, with implications for scalable, cost-effective multilingual NLP deployment.

Abstract

Instruction tuning (IT) is widely used to teach pretrained large language models (LLMs) to follow arbitrary instructions, but is under-studied in multilingual settings. In this work, we conduct a systematic study of zero-shot cross-lingual transfer in IT, when an LLM is instruction-tuned on English-only data and then tested on user prompts in other languages. We advocate for the importance of evaluating various aspects of model responses in multilingual instruction following and investigate the influence of different model configuration choices. We find that cross-lingual transfer does happen successfully in IT even if all stages of model training are English-centric, but only if multiliguality is taken into account in hyperparameter tuning and with large enough IT data. English-trained LLMs are capable of generating correct-language, comprehensive and helpful responses in other languages, but suffer from low factuality and may occasionally have fluency errors.

Zero-shot cross-lingual transfer in instruction tuning of large language models

TL;DR

Abstract

Paper Structure (27 sections, 5 figures, 3 tables)

This paper contains 27 sections, 5 figures, 3 tables.

Introduction
Related work
Our evaluation methodology
Experimental setup
Experimental results and discussion
Main evaluation
Additional experiment with task modifiers
Preliminary study based on surface metrics
Conclusion
Limitations and broader impact
Acknowledgments
Extended related work
Zero-shot cross-lingual transfer
Multilingual instruction following.
Role of base LLM.
...and 12 more sections

Figures (5)

Figure 1: Zero-shot cross-lingual transfer in instruction tuning: an LLM is instruction-tuned on English-only data and then tested on user prompts in other languages. Our study focuses on analyzing various aspects of generated outputs and model configuration choices.
Figure 2: Results of human evaluation (left) and evaluation with GPT-3.5 (right). All scores from 0 to 2, heatmap colors visualize written scores. Base models: LLaMA-2-7B/13B (English-centric) or Tower-7B (10 languages). Datasets: Dolly (15k) or LIMA (1k). Instruction tuning data strategies: En (English-only data) or DT (multilingual IT data obtained using data translation). Adaptation strategy: FT (full finetuning) or LoRA (low-rank adaptation).
Figure 3: Left: Results of evaluating surface features of the responses. Ticks denote the chosen LR for each configuration. Base models: LLaMA-2-7B/13B (English-centric) or Tower-7B (10 languages). Datasets: Dolly (15k) or LIMA (1k). Data strategies: En (English-only data) or DT (multilingual data obtained using data translation). Adaptation strategy: FT (full finetuning) or LoRA (low-rank adaptation). Right: Human-evaluated helpfulness of the default model broken down by task category.
Figure 4: Per-language results of human evaluation (left columns) and evaluation with GPT-3.5 (right column). All scores from 0 to 2. Heatmap colors visualize written scores.
Figure 5: Agreement statistics between human evaluation and GPT-3.5 evaluation. Each value in the heatmap coordinates (X, Y) represents the percentage of responses which were given rating X by GPT-3.5 and rating Y by human evaluator.

Zero-shot cross-lingual transfer in instruction tuning of large language models

TL;DR

Abstract

Zero-shot cross-lingual transfer in instruction tuning of large language models

Authors

TL;DR

Abstract

Table of Contents

Figures (5)