Deep Exploration of Cross-Lingual Zero-Shot Generalization in Instruction Tuning

Janghoon Han; Changho Lee; Joongbo Shin; Stanley Jungkyu Choi; Honglak Lee; Kynghoon Bae

Deep Exploration of Cross-Lingual Zero-Shot Generalization in Instruction Tuning

Janghoon Han, Changho Lee, Joongbo Shin, Stanley Jungkyu Choi, Honglak Lee, Kynghoon Bae

TL;DR

This work investigates cross-lingual zero-shot generalization in instruction tuning by constructing KORANI, a Korean natural-instruction meta-dataset, and pairing it with English P3 to study bidirectional transfer using mT5 models. It shows that cross-lingual instruction tuning with aligned cross-lingual templates yields improvements in both Korean and English, often matching or surpassing monolingual tuning, and demonstrates the importance of relevant data across languages over linguistic congruence for unseen tasks. The study introduces cross-lingual template strategies and bilingual experimentation, revealing that task diversity and template alignment can drive robust cross-task generalization across languages. These findings suggest cross-lingual instruction tuning as a viable alternative to monolingual approaches, especially for low-resource languages, and highlight the practical impact of data collection and template design in multilingual NLP. The work provides a valuable resource (KORANI) and methodological guidance for future cross-lingual instruction-tuning research.

Abstract

Instruction tuning has emerged as a powerful technique, significantly boosting zero-shot performance on unseen tasks. While recent work has explored cross-lingual generalization by applying instruction tuning to multilingual models, previous studies have primarily focused on English, with a limited exploration of non-English tasks. For an in-depth exploration of cross-lingual generalization in instruction tuning, we perform instruction tuning individually for two distinct language meta-datasets. Subsequently, we assess the performance on unseen tasks in a language different from the one used for training. To facilitate this investigation, we introduce a novel non-English meta-dataset named "KORANI" (Korean Natural Instruction), comprising 51 Korean benchmarks. Moreover, we design cross-lingual templates to mitigate discrepancies in language and instruction-format of the template between training and inference within the cross-lingual setting. Our experiments reveal consistent improvements through cross-lingual generalization in both English and Korean, outperforming baseline by average scores of 20.7\% and 13.6\%, respectively. Remarkably, these enhancements are comparable to those achieved by monolingual instruction tuning and even surpass them in some tasks. The result underscores the significance of relevant data acquisition across languages over linguistic congruence with unseen tasks during instruction tuning.

Deep Exploration of Cross-Lingual Zero-Shot Generalization in Instruction Tuning

TL;DR

Abstract

Paper Structure (40 sections, 7 figures, 14 tables)

This paper contains 40 sections, 7 figures, 14 tables.

Introduction
Related Work
Instruction Tuning
Cross-lingual Task Generalization in Instruction Tuning
Measuring Cross-lingual Zero-shot Generalization
Dataset for Instruction Tuning
KORANI: KOReAn Natural Instructions
Benchmark Collection
Instruction Creation
Quality Control
English Instruction Tuning Benchmarks
Statistics of KORANI and P3
Addressing Templates Misalignment Challenges in Cross-Lingual Instruction Tuning Scenarios
Model
Experimental Setup
...and 25 more sections

Figures (7)

Figure 1: KORANI datasets and task taxonomy. Green datasets are NLG datasets. Yellow datasets are NLU datasets. We follow task categorization from T0
Figure 2: Comparison of model variants mT-En, mT-En-CT, and mT-En-CI on samples from Rotten Tomatoes, esNLI for P3 T0, and KLUE NLI for KORANI. The dashed line differentiates training and evaluation, while the solid line distinguishes monolingual and cross-lingual generalization. mT-En-CT pairs English datasets with either English or Korean templates during training, and mT-En-CI pairs Korean datasets with English templates during evaluation.
Figure 3: Performance of zero-shot and cross-lingual generalization. Scores are datasets average for each task cluster. The first row denotes KORANI unseen tasks, and the second row denotes P3 unseen tasks. Average chart averages seven different task results. Appendix \ref{['sec:crosslingual_breakdown']} breaks down the performance by datasets.
Figure 4: Bilingual instruction tuning performance in KORANI and P3. mT-Bi+-CT employs the CT training method for non-target language datasets only. Appendix \ref{['sec:add-bi']} covers additional experiments on the cross-lingual template, and \ref{['sec:bilingual_breakdown']} breaks down the performance by datasets.
Figure 5: Model performance vs. size. The random line represents the average score random choice in the options list for classification tasks, and the ROUGE-L score of a copy of input for generation tasks. Appendix \ref{['sec:scale_up_breakdown']} breaks down the performance by datasets.
...and 2 more figures

Deep Exploration of Cross-Lingual Zero-Shot Generalization in Instruction Tuning

TL;DR

Abstract

Deep Exploration of Cross-Lingual Zero-Shot Generalization in Instruction Tuning

Authors

TL;DR

Abstract

Table of Contents

Figures (7)