PLUG: Leveraging Pivot Language in Cross-Lingual Instruction Tuning

Zhihan Zhang; Dong-Ho Lee; Yuwei Fang; Wenhao Yu; Mengzhao Jia; Meng Jiang; Francesco Barbieri

PLUG: Leveraging Pivot Language in Cross-Lingual Instruction Tuning

Zhihan Zhang, Dong-Ho Lee, Yuwei Fang, Wenhao Yu, Mengzhao Jia, Meng Jiang, Francesco Barbieri

TL;DR

This work tackles the challenge of instruction tuning for low-resource languages by introducing PLUG, a pivot-language guided generation approach that uses a high-resource pivot language to structure responses in the target language. By training models to first formulate instructions and responses in the pivot language before producing the final output, PLUG leverages stronger proficiency in the pivot language to improve instruction-following in the target language. The authors also present X-AlpacaEval, a multilingual benchmark for open-ended instructions, and demonstrate that PLUG yields substantial improvements across Chinese, Korean, Italian, and Spanish, with notable gains in truthfulness and reasoning. They further show pivot-language flexibility beyond English, data-efficient learning, and favorable comparisons to translation-based baselines, while outlining limitations and ethical considerations for broader deployment.

Abstract

Instruction tuning has remarkably advanced large language models (LLMs) in understanding and responding to diverse human instructions. Despite the success in high-resource languages, its application in lower-resource ones faces challenges due to the imbalanced foundational abilities of LLMs across different languages, stemming from the uneven language distribution in their pre-training data. To tackle this issue, we propose pivot language guided generation (PLUG), an approach that utilizes a high-resource language, primarily English, as the pivot to enhance instruction tuning in lower-resource languages. It trains the model to first process instructions in the pivot language, and then produce responses in the target language. To evaluate our approach, we introduce a benchmark, X-AlpacaEval, of instructions in 4 languages (Chinese, Korean, Italian, and Spanish), each annotated by professional translators. Our approach demonstrates a significant improvement in the instruction-following abilities of LLMs by 29% on average, compared to directly responding in the target language alone. Further experiments validate the versatility of our approach by employing alternative pivot languages beyond English to assist languages where LLMs exhibit lower proficiency. Our code and data are available at https://github.com/ytyz1307zzh/PLUG.

PLUG: Leveraging Pivot Language in Cross-Lingual Instruction Tuning

TL;DR

Abstract

Paper Structure (35 sections, 1 equation, 11 figures, 16 tables)

This paper contains 35 sections, 1 equation, 11 figures, 16 tables.

Introduction
Related Work
Instruction Tuning
Multilingual LLMs
Pivot Language Guided Generation
Evaluation Settings
Benchmarks
X-AlpacaEval
Truthfulness & Reasoning Benchmarks
Model Settings
Methods to Compare
Results
Open-Ended Instructions
Study of Pivot Languages
Ablation Study
...and 20 more sections

Figures (11)

Figure 1: When humans struggle to learn a second language, they tend to comprehend the instruction and draft a response in their native language, before finally responding in the target language. With a similar philosophy, we train LLMs to utilize a high-resource language as the pivot language when responding to instructions in the target language.
Figure 2: The comparison between monolingual response training (top) and PLUG training (bottom). In this example, Chinese is the target language and English is the pivot. The monolingual response does not follow the review-writing instruction, while PLUG successfully generates a vivid and natural user review.
Figure 3: PLUG vs. monolingual response training on LLaMA-2: win-loss differential with different amounts of training data, on randomly sampled 200 instructions from X-AlpacaEval. The stars are comparisons when both PLUG and the baseline use all 96k data.
Figure 4: TruthfulQA and SVAMP experiments on LLaMA-2. TruthfulQA scores are the percentage of generations that are both truthful and informative.
Figure 5: TruthfulQA experiments on PolyLM. TruthfulQA scores are the percentage of generations that are both truthful and informative.
...and 6 more figures

PLUG: Leveraging Pivot Language in Cross-Lingual Instruction Tuning

TL;DR

Abstract

PLUG: Leveraging Pivot Language in Cross-Lingual Instruction Tuning

Authors

TL;DR

Abstract

Table of Contents

Figures (11)