LangSAMP: Language-Script Aware Multilingual Pretraining
Yihong Liu, Haotian Ye, Chunlan Ma, Mingyang Wang, Hinrich Schütze
TL;DR
LangSAMP (Language-Script Aware Multilingual Pretraining) addresses language neutrality in multilingual encoders by introducing separate language and script embeddings that are added to Transformer outputs before the MLM head. This architectural adjustment decouples language- and script-specific information from the backbone, enabling a universal encoder during fine-tuning while enhancing zero-shot crosslingual transfer on a 500+ language corpus. Extensive experiments across sentence retrieval, text classification, and sequence labeling show consistent gains, especially for tail languages and sequence-level tasks, and analyses reveal improved language neutrality via higher cross-language similarity and informative donor-language selection. The work demonstrates that auxiliary embeddings can be leveraged without increasing downstream model parameters, offering practical benefits for crosslingual NLP and guiding future exploration of language-aware representations in multilingual settings.
Abstract
Recent multilingual pretrained language models (mPLMs) often avoid using language embeddings -- learnable vectors assigned to individual languages. However, this places a significant burden on token representations to encode all language-specific information, which may hinder language neutrality. To address this limitation, we propose Language-Script Aware Multilingual Pretraining (LangSAMP), a method that incorporates both language and script embeddings to enhance representation learning. Specifically, we integrate these embeddings into the output of the Transformer blocks before passing the final representations to the language modeling head for prediction. We apply LangSAMP to the continual pretraining of XLM-R on a highly multilingual corpus covering more than 500 languages. The resulting model consistently outperforms the baseline in zero-shot crosslingual transfer across diverse downstream tasks. Extensive analysis reveals that language and script embeddings capture language- and script-specific nuances, which benefits more language-neutral representations, proven by improved pairwise cosine similarity. In our case study, we also show that language and script embeddings can be used to select better source languages for crosslingual transfer. We make our code and models publicly available at https://github.com/cisnlp/LangSAMP.
