Table of Contents
Fetching ...

Mitigating Language-Level Performance Disparity in mPLMs via Teacher Language Selection and Cross-lingual Self-Distillation

Haozhe Zhao, Zefan Cai, Shuzheng Si, Liang Chen, Yufeng He, Kaikai An, Baobao Chang

TL;DR

The paper tackles language-level performance disparity in multilingual pretrained language models caused by uneven multilingual pretraining data. It proposes ALSACE, a two-stage method using unlabeled multilingual data to select task-relevant teacher languages and perform cross-lingual self-distillation within the same model. ALSACE yields reduced cross-lingual transfer gaps and competitive multilingual NLU performance across XNLI, PAWS-X, XCOPA, and GeoMLAMA for models like XLM-R and mT5, and remains effective in limited-resource settings with as little as 500-shot unlabeled data. By automatically identifying adaptive teachers and distilling knowledge among languages, the approach broadens knowledge transfer to language-specific and language-agnostic knowledge, with practical implications for equitable multilingual NLP.

Abstract

Large-scale multilingual Pretrained Language Models (mPLMs) yield impressive performance on cross-language tasks, yet significant performance disparities exist across different languages within the same mPLM. Previous studies endeavored to narrow these disparities by supervise fine-tuning the mPLMs with multilingual data. However, obtaining labeled multilingual data is time-consuming, and fine-tuning mPLM with limited labeled multilingual data merely encapsulates the knowledge specific to the labeled data. Therefore, we introduce ALSACE to leverage the learned knowledge from the well-performing languages to guide under-performing ones within the same mPLM, eliminating the need for additional labeled multilingual data. Experiments show that ALSACE effectively mitigates language-level performance disparity across various mPLMs while showing the competitive performance on different multilingual NLU tasks, ranging from full resource to limited resource settings. The code for our approach is available at https://github.com/pkunlp-icler/ALSACE.

Mitigating Language-Level Performance Disparity in mPLMs via Teacher Language Selection and Cross-lingual Self-Distillation

TL;DR

The paper tackles language-level performance disparity in multilingual pretrained language models caused by uneven multilingual pretraining data. It proposes ALSACE, a two-stage method using unlabeled multilingual data to select task-relevant teacher languages and perform cross-lingual self-distillation within the same model. ALSACE yields reduced cross-lingual transfer gaps and competitive multilingual NLU performance across XNLI, PAWS-X, XCOPA, and GeoMLAMA for models like XLM-R and mT5, and remains effective in limited-resource settings with as little as 500-shot unlabeled data. By automatically identifying adaptive teachers and distilling knowledge among languages, the approach broadens knowledge transfer to language-specific and language-agnostic knowledge, with practical implications for equitable multilingual NLP.

Abstract

Large-scale multilingual Pretrained Language Models (mPLMs) yield impressive performance on cross-language tasks, yet significant performance disparities exist across different languages within the same mPLM. Previous studies endeavored to narrow these disparities by supervise fine-tuning the mPLMs with multilingual data. However, obtaining labeled multilingual data is time-consuming, and fine-tuning mPLM with limited labeled multilingual data merely encapsulates the knowledge specific to the labeled data. Therefore, we introduce ALSACE to leverage the learned knowledge from the well-performing languages to guide under-performing ones within the same mPLM, eliminating the need for additional labeled multilingual data. Experiments show that ALSACE effectively mitigates language-level performance disparity across various mPLMs while showing the competitive performance on different multilingual NLU tasks, ranging from full resource to limited resource settings. The code for our approach is available at https://github.com/pkunlp-icler/ALSACE.
Paper Structure (24 sections, 5 equations, 5 figures, 13 tables)

This paper contains 24 sections, 5 equations, 5 figures, 13 tables.

Figures (5)

  • Figure 1: ALSACE can reduce language-level performance disparity via mitigating knowledge disparity across languages on GeoMLAMA benchmark yin2022geomlama.
  • Figure 2: Result of ALSACE on XLM-R-large in GeoMLAMA dataset. The result shows that ALSACE utilizes the teacher languages to guide other languages and generally improves their languages-specific knowledge.
  • Figure 3: Accurately Answered Questions across All Languages in XNLI Baseline.
  • Figure 4: Performance of different Ensemble Methods.
  • Figure 5: The Comparison of ALSACE Performance with and without Language Selection on XNLI dataset set. All results are based on XLM-R-large.