Table of Contents
Fetching ...

InstructAlign: High-and-Low Resource Language Alignment via Continual Crosslingual Instruction Tuning

Samuel Cahyawijaya, Holy Lovenia, Tiezheng Yu, Willy Chung, Pascale Fung

TL;DR

InstructAlign tackles the problem of limited language coverage and catastrophic forgetting in instruction-tuned LLMs when adapting to underrepresented languages. It combines crosslingual instruction-based alignment (TLM, MT, XSS) with continual instruction tuning via experience replay to learn low-resource languages without degrading existing multitask abilities. Empirical results on Indonesian local languages show 5–10% gains in weighted F1 for L2 while preserving L1 performance, with larger models benefiting more from the approach and transfer to related L3 languages demonstrated (Pearson ≈ 0.96). The work advances language adaptation for instruction-tuned LLMs, offering a practical pathway to broader, more inclusive multilingual NLP systems and enabling safer forward transfer to unseen related languages.

Abstract

Large language models (LLMs) that are tuned with instructions have demonstrated remarkable capabilities in various tasks and languages. However, their ability to generalize to underrepresented languages is limited due to the scarcity of available data. Additionally, directly adapting new languages to instruction-tuned LLMs can result in catastrophic forgetting, which leads to the loss of multitasking ability. To address this issue, we propose InstructAlign which uses continual crosslingual instruction tuning to enable LLMs to align new unseen languages with previously learned high-resource languages. Our results demonstrate the effectiveness of InstructAlign in enabling the model to understand low-resource languages with limited parallel data while preventing catastrophic forgetting. Our work contributes to the advancement of language adaptation methods, particularly for adapting instruction-tuned LLMs to underrepresented languages. Our code is released on https://github.com/HLTCHKUST/InstructAlign

InstructAlign: High-and-Low Resource Language Alignment via Continual Crosslingual Instruction Tuning

TL;DR

InstructAlign tackles the problem of limited language coverage and catastrophic forgetting in instruction-tuned LLMs when adapting to underrepresented languages. It combines crosslingual instruction-based alignment (TLM, MT, XSS) with continual instruction tuning via experience replay to learn low-resource languages without degrading existing multitask abilities. Empirical results on Indonesian local languages show 5–10% gains in weighted F1 for L2 while preserving L1 performance, with larger models benefiting more from the approach and transfer to related L3 languages demonstrated (Pearson ≈ 0.96). The work advances language adaptation for instruction-tuned LLMs, offering a practical pathway to broader, more inclusive multilingual NLP systems and enabling safer forward transfer to unseen related languages.

Abstract

Large language models (LLMs) that are tuned with instructions have demonstrated remarkable capabilities in various tasks and languages. However, their ability to generalize to underrepresented languages is limited due to the scarcity of available data. Additionally, directly adapting new languages to instruction-tuned LLMs can result in catastrophic forgetting, which leads to the loss of multitasking ability. To address this issue, we propose InstructAlign which uses continual crosslingual instruction tuning to enable LLMs to align new unseen languages with previously learned high-resource languages. Our results demonstrate the effectiveness of InstructAlign in enabling the model to understand low-resource languages with limited parallel data while preventing catastrophic forgetting. Our work contributes to the advancement of language adaptation methods, particularly for adapting instruction-tuned LLMs to underrepresented languages. Our code is released on https://github.com/HLTCHKUST/InstructAlign
Paper Structure (32 sections, 6 figures, 19 tables)

This paper contains 32 sections, 6 figures, 19 tables.

Figures (6)

  • Figure 1: The number of languages supported by existing LLMs (green region) per language family. Existing LLMs only support a fraction of languages around the globe. Most of them are within the Indo-European language family, while most other language families are underrepresented or even unexplored.
  • Figure 2: Example of the alignment-based crosslingual instruction prompts, i.e., bilingual denoising (TLM), machine translation (MT), and crosslingual semantic similarity (XSS) in comparison to the monolingual denoising (MLM).
  • Figure 3: Average performance of various models across different model scales on the L1 and L2 languages subsets of the NT-S and NX-S datasets.
  • Figure 4: $\Delta$ weighted F1 of InstructAlign tuned BLOOMZ-560M with (left) TLM and (right) XSS objectives various continual instruction-tuned approaches compared to the original BLOOMZ-560M baseline. Negative scores indicate that the model performs worse compared to the baseline.
  • Figure 5: Correlation of $\Delta$ weighted F1 from the InstructAlign tuned models to the corresponding BLOOMZ backbone models on novel and unseen languages. $R$ denotes the Pearson correlation coefficient.
  • ...and 1 more figures