LangSAMP: Language-Script Aware Multilingual Pretraining

Yihong Liu; Haotian Ye; Chunlan Ma; Mingyang Wang; Hinrich Schütze

LangSAMP: Language-Script Aware Multilingual Pretraining

Yihong Liu, Haotian Ye, Chunlan Ma, Mingyang Wang, Hinrich Schütze

TL;DR

LangSAMP (Language-Script Aware Multilingual Pretraining) addresses language neutrality in multilingual encoders by introducing separate language and script embeddings that are added to Transformer outputs before the MLM head. This architectural adjustment decouples language- and script-specific information from the backbone, enabling a universal encoder during fine-tuning while enhancing zero-shot crosslingual transfer on a 500+ language corpus. Extensive experiments across sentence retrieval, text classification, and sequence labeling show consistent gains, especially for tail languages and sequence-level tasks, and analyses reveal improved language neutrality via higher cross-language similarity and informative donor-language selection. The work demonstrates that auxiliary embeddings can be leveraged without increasing downstream model parameters, offering practical benefits for crosslingual NLP and guiding future exploration of language-aware representations in multilingual settings.

Abstract

Recent multilingual pretrained language models (mPLMs) often avoid using language embeddings -- learnable vectors assigned to individual languages. However, this places a significant burden on token representations to encode all language-specific information, which may hinder language neutrality. To address this limitation, we propose Language-Script Aware Multilingual Pretraining (LangSAMP), a method that incorporates both language and script embeddings to enhance representation learning. Specifically, we integrate these embeddings into the output of the Transformer blocks before passing the final representations to the language modeling head for prediction. We apply LangSAMP to the continual pretraining of XLM-R on a highly multilingual corpus covering more than 500 languages. The resulting model consistently outperforms the baseline in zero-shot crosslingual transfer across diverse downstream tasks. Extensive analysis reveals that language and script embeddings capture language- and script-specific nuances, which benefits more language-neutral representations, proven by improved pairwise cosine similarity. In our case study, we also show that language and script embeddings can be used to select better source languages for crosslingual transfer. We make our code and models publicly available at https://github.com/cisnlp/LangSAMP.

LangSAMP: Language-Script Aware Multilingual Pretraining

TL;DR

Abstract

Paper Structure (42 sections, 1 equation, 5 figures, 19 tables)

This paper contains 42 sections, 1 equation, 5 figures, 19 tables.

Introduction
Related Work
Multilingual Pretrained Language Models
Language Embeddings
Methodology
Language and Script Embeddings
Language-Script Aware Modeling
Fine-tuning on Downstream tasks
Experiments
Setups
Training Corpora and Tokenizer
Continued pretraining
Baseline
Downstream Tasks
Sentence Retrieval.
...and 27 more sections

Figures (5)

Figure 1: An illustration of LangSAMP for a single batch. Each text may come from different languages and different scripts. Language and script embeddings are added to the transformer output before feeding into the language modeling head. This setup improves the language neutrality of the representations as the auxiliary embeddings share the burden by encoding some language- and script-specific information useful for decoding specific tokens in masked language modeling.
Figure 2: Illustration of LangSAMP applied to a German sentence (left) and a Ukrainian sentence (right), both meaning "I like the cute cat". Language and script embeddings are added to the outputs from the transformer block. The resulting representation is used to predict the original tokens at the [mask] positions in MLM training.
Figure 3: PCA visualizations of head language embeddings (left) and script embeddings (right). We see that some related languages and scripts are close to each other, indicating that they encode language- and script-specific information. Data imbalance may have caused some languages/scripts with limited data to appear as outliers.
Figure 4: Similarity improvement (by percentage) from baseline to LangSAMP in terms of the pairwise cosine similarity. Similarity is increased for each pair, indicating better language neutrality of the representations.
Figure 5: Comparison between baseline (left) and LangSAMP (right) in terms of the pairwise cosine similarity. LangSAMP achieves better similarity for each pair, indicating improved language neutrality of the representations.

LangSAMP: Language-Script Aware Multilingual Pretraining

TL;DR

Abstract

LangSAMP: Language-Script Aware Multilingual Pretraining

Authors

TL;DR

Abstract

Table of Contents

Figures (5)