Table of Contents
Fetching ...

LinguaLIFT: An Effective Two-stage Instruction Tuning Framework for Low-Resource Language Reasoning

Hongbin Zhang, Kehai Chen, Xuefeng Bai, Yang Xiang, Min Zhang

TL;DR

This work tackles the persistent gap in reasoning for low-resource languages in multilingual LLMs caused by data imbalance and benchmark biases. It introduces LinguaLIFT, a two-stage instruction-tuning framework that uses a frozen language alignment layer learned through code-switched tuning to transfer English reasoning to low-resource languages, without requiring multilingual or parallel data. A new benchmark, MMWP, covers 48 languages across resource levels to evaluate multilingual mathematical reasoning, prompting broad evaluation beyond existing high-resource-dominant tests. Experimental results demonstrate that LinguaLIFT consistently outperforms strong baselines on MMWP and related benchmarks, generalizes across LLMs and tasks, and reveals insights into cross-lingual transfer, code-switching effects, and alignment visualization, highlighting its practical potential for inclusive multilingual reasoning.

Abstract

Large language models (LLMs) have exhibited impressive multilingual reasoning capabilities, driven by extensive multilingual pre-training corpora and instruction fine-tuning data. However, a performance gap exists between high- and low-resource language reasoning tasks due to the language imbalance in the pre-training corpus, which is exacerbated by evaluation bias in existing reasoning benchmarks lacking low-resource language coverage. To alleviate this issue, we propose LinguaLIFT, a two-stage instruction tuning framework for advancing low-resource language reasoning. LinguaLIFT employs a language alignment layer to capture multilingual alignment in a code-switched tuning way without requiring multilingual instruction or parallel data, thereby transferring the cross-lingual reasoning capabilities to low-resource languages through English-only instruction tuning data. To comprehensively evaluate the multilingual reasoning capabilities, we introduce the Multilingual Math World Problem (MMWP) benchmark, which spans 21 low-resource, 17 medium-resource, and 10 high-resource languages. Experimental results show that LinguaLIFT outperforms several competitive baselines across MMWP and four widely used benchmarks.

LinguaLIFT: An Effective Two-stage Instruction Tuning Framework for Low-Resource Language Reasoning

TL;DR

This work tackles the persistent gap in reasoning for low-resource languages in multilingual LLMs caused by data imbalance and benchmark biases. It introduces LinguaLIFT, a two-stage instruction-tuning framework that uses a frozen language alignment layer learned through code-switched tuning to transfer English reasoning to low-resource languages, without requiring multilingual or parallel data. A new benchmark, MMWP, covers 48 languages across resource levels to evaluate multilingual mathematical reasoning, prompting broad evaluation beyond existing high-resource-dominant tests. Experimental results demonstrate that LinguaLIFT consistently outperforms strong baselines on MMWP and related benchmarks, generalizes across LLMs and tasks, and reveals insights into cross-lingual transfer, code-switching effects, and alignment visualization, highlighting its practical potential for inclusive multilingual reasoning.

Abstract

Large language models (LLMs) have exhibited impressive multilingual reasoning capabilities, driven by extensive multilingual pre-training corpora and instruction fine-tuning data. However, a performance gap exists between high- and low-resource language reasoning tasks due to the language imbalance in the pre-training corpus, which is exacerbated by evaluation bias in existing reasoning benchmarks lacking low-resource language coverage. To alleviate this issue, we propose LinguaLIFT, a two-stage instruction tuning framework for advancing low-resource language reasoning. LinguaLIFT employs a language alignment layer to capture multilingual alignment in a code-switched tuning way without requiring multilingual instruction or parallel data, thereby transferring the cross-lingual reasoning capabilities to low-resource languages through English-only instruction tuning data. To comprehensively evaluate the multilingual reasoning capabilities, we introduce the Multilingual Math World Problem (MMWP) benchmark, which spans 21 low-resource, 17 medium-resource, and 10 high-resource languages. Experimental results show that LinguaLIFT outperforms several competitive baselines across MMWP and four widely used benchmarks.

Paper Structure

This paper contains 58 sections, 3 equations, 15 figures, 25 tables.

Figures (15)

  • Figure 1: Examples from the MGSM shi2023language dataset, where the mathematical problems share the same meaning across languages, but LLMs generate different answers. The red text marks the erroneous reasoning in the responses. Translations of the responses are provided in dashed boxes.
  • Figure 2: Overview of the proposed LinguaLIFT approach. Stage-I (Language Align): A language alignment layer is introduced into the LLM to adapt the pre-trained multilingual encoder, thereby enhancing multilingual alignment through code-switched tuning. Stage-II (Task Transfer): The LLM is fine-tuned on high-quality, English-only instruction data with the language alignment layer frozen, allowing the LLM to transfer reasoning capabilities learned from English to low-resource languages.
  • Figure 3: Ablation study of two-stage training on MGSM, showing average accuracy for low-resource (LR.), high-resource (HR.), and all languages (Avg.).
  • Figure 4: Trainable modules ablation on MGSM. The usage of abbreviation is the same in Figure \ref{['stage_ablation']}.
  • Figure 5: Accuracy (%) of LinguaLIFT across encoders' scales and types on MGSM's low-resource reasoning.
  • ...and 10 more figures