Table of Contents
Fetching ...

Multilingual Encoder Knows more than You Realize: Shared Weights Pretraining for Extremely Low-Resource Languages

Zeli Su, Ziyin Zhang, Guixian Xu, Jianing Liu, XU Han, Ting Zhang, Yushuang Dong

TL;DR

This work tackles the underrepresentation of extremely low-resource languages, such as Tibetan, Uyghur, Kazakh, and Mongolian, in multilingual NLP by adapting a pretrained encoder to generation tasks through a shared-weight framework. It introduces XLM-SWCM, which reuses encoder weights in a decoder, employs a two-layered decoder (Normal and Custom) with a strategic insertion pattern, and pretrains on MC2 with denoising auto-encoding and machine translation objectives. Across text summarization, machine reading comprehension, and machine translation, XLM-SWCM outperforms larger baselines (including MC2-LLaMA-13B) while requiring fewer parameters, demonstrating strong cross-lingual transfer and data efficiency. The results highlight the framework’s potential to extend multilingual generation to languages with scarce data and resources, underscoring the practical impact for linguistic equity and AI accessibility.

Abstract

While multilingual language models like XLM-R have advanced multilingualism in NLP, they still perform poorly in extremely low-resource languages. This situation is exacerbated by the fact that modern LLMs such as LLaMA and Qwen support far fewer languages than XLM-R, making text generation models non-existent for many languages in the world. To tackle this challenge, we propose a novel framework for adapting multilingual encoders to text generation in extremely low-resource languages. By reusing the weights between the encoder and the decoder, our framework allows the model to leverage the learned semantic space of the encoder, enabling efficient learning and effective generalization in low-resource languages. Applying this framework to four Chinese minority languages, we present XLM-SWCM, and demonstrate its superior performance on various downstream tasks even when compared with much larger models.

Multilingual Encoder Knows more than You Realize: Shared Weights Pretraining for Extremely Low-Resource Languages

TL;DR

This work tackles the underrepresentation of extremely low-resource languages, such as Tibetan, Uyghur, Kazakh, and Mongolian, in multilingual NLP by adapting a pretrained encoder to generation tasks through a shared-weight framework. It introduces XLM-SWCM, which reuses encoder weights in a decoder, employs a two-layered decoder (Normal and Custom) with a strategic insertion pattern, and pretrains on MC2 with denoising auto-encoding and machine translation objectives. Across text summarization, machine reading comprehension, and machine translation, XLM-SWCM outperforms larger baselines (including MC2-LLaMA-13B) while requiring fewer parameters, demonstrating strong cross-lingual transfer and data efficiency. The results highlight the framework’s potential to extend multilingual generation to languages with scarce data and resources, underscoring the practical impact for linguistic equity and AI accessibility.

Abstract

While multilingual language models like XLM-R have advanced multilingualism in NLP, they still perform poorly in extremely low-resource languages. This situation is exacerbated by the fact that modern LLMs such as LLaMA and Qwen support far fewer languages than XLM-R, making text generation models non-existent for many languages in the world. To tackle this challenge, we propose a novel framework for adapting multilingual encoders to text generation in extremely low-resource languages. By reusing the weights between the encoder and the decoder, our framework allows the model to leverage the learned semantic space of the encoder, enabling efficient learning and effective generalization in low-resource languages. Applying this framework to four Chinese minority languages, we present XLM-SWCM, and demonstrate its superior performance on various downstream tasks even when compared with much larger models.

Paper Structure

This paper contains 36 sections, 1 equation, 4 figures, 6 tables.

Figures (4)

  • Figure 1: The relationship between population size and dataset size in OSCAR (y-axis, in MB) for various high-, middle-, and low-resource languages. Kazakh (kk), Mongolian (mn), Tibetan (bo), and Uyghur (ug) represent the languages we studied in this work.
  • Figure 2: An overview of the shared weight framework for efficiently adapting multilingual encoders to text generation in low-resource languages.
  • Figure 3: The weight initialization schemes for the CustomDecoderLayer. The colored arrows indicate the initialization of weights between the different components.
  • Figure 4: ROUGE-L scores on Tibetan summarization for different X-values (insertion frequency of normal layers). The three lines correspond to different dataset sizes.