Linguistic Entity Masking to Improve Cross-Lingual Representation of Multilingual Language Models for Low-Resource Languages

Aloka Fernando; Surangika Ranathunga

Linguistic Entity Masking to Improve Cross-Lingual Representation of Multilingual Language Models for Low-Resource Languages

Aloka Fernando, Surangika Ranathunga

TL;DR

The paper tackles suboptimal cross-lingual representations in multilingual pre-trained language models for low-resource languages by introducing Linguistic Entity Masking (LEM), a masking strategy that targets a single token within linguistic entities (NEs, nouns, verbs) during continual pre-training. It uses a two-stage process with monolingual ($LEM_{mono}$) and parallel ($LEM_{para}$) data, aiming to preserve contextual integrity while enhancing cross-lingual signals. Across three tasks—bitext mining, parallel data curation, and code-mixed sentiment analysis—LEM consistently outperforms the MLM+TLM baseline and other masking strategies, with NE masking providing the largest gains. The method demonstrates robustness to noisy data and shows practical potential for improving cross-lingual capabilities in low-resource language settings, particularly when leveraging dependent monolingual data from parallel corpora.

Abstract

Multilingual Pre-trained Language models (multiPLMs), trained on the Masked Language Modelling (MLM) objective are commonly being used for cross-lingual tasks such as bitext mining. However, the performance of these models is still suboptimal for low-resource languages (LRLs). To improve the language representation of a given multiPLM, it is possible to further pre-train it. This is known as continual pre-training. Previous research has shown that continual pre-training with MLM and subsequently with Translation Language Modelling (TLM) improves the cross-lingual representation of multiPLMs. However, during masking, both MLM and TLM give equal weight to all tokens in the input sequence, irrespective of the linguistic properties of the tokens. In this paper, we introduce a novel masking strategy, Linguistic Entity Masking (LEM) to be used in the continual pre-training step to further improve the cross-lingual representations of existing multiPLMs. In contrast to MLM and TLM, LEM limits masking to the linguistic entity types nouns, verbs and named entities, which hold a higher prominence in a sentence. Secondly, we limit masking to a single token within the linguistic entity span thus keeping more context, whereas, in MLM and TLM, tokens are masked randomly. We evaluate the effectiveness of LEM using three downstream tasks, namely bitext mining, parallel data curation and code-mixed sentiment analysis using three low-resource language pairs English-Sinhala, English-Tamil, and Sinhala-Tamil. Experiment results show that continually pre-training a multiPLM with LEM outperforms a multiPLM continually pre-trained with MLM+TLM for all three tasks.

Linguistic Entity Masking to Improve Cross-Lingual Representation of Multilingual Language Models for Low-Resource Languages

TL;DR

) and parallel (

) data, aiming to preserve contextual integrity while enhancing cross-lingual signals. Across three tasks—bitext mining, parallel data curation, and code-mixed sentiment analysis—LEM consistently outperforms the MLM+TLM baseline and other masking strategies, with NE masking providing the largest gains. The method demonstrates robustness to noisy data and shows practical potential for improving cross-lingual capabilities in low-resource language settings, particularly when leveraging dependent monolingual data from parallel corpora.

Abstract

Paper Structure (38 sections, 5 equations, 4 figures, 18 tables)

This paper contains 38 sections, 5 equations, 4 figures, 18 tables.

Introduction
Related Work
MLM and TLM Objectives
Different Masking Strategies
Methodology
Theoretical Framework for Linguistic Entity Masking (LEM)
Experiments
Impact of the type of monolingual data in $LEM_{mono}$
Evaluation of Different Masking Strategies
Evaluation of LEM Strategy and Ablation Study
Evaluation Tasks
Bitext Mining
Parallel Data Curation
Code-Mixed Sentiment Analysis
Experiment Setup
...and 23 more sections

Figures (4)

Figure 1: Self-attention weights among the words for an English and its corresponding Sinhala sentence. The darker the colour is, the stronger the relationship (ie. self-attention weight) between the two words.
Figure 2: A comparison of the existing masking strategies considering an example from the English-Sinhala language pair. Sub-word masking, Whole Word masking, span masking, and $LEM_{mono}$ consider only monolingual sentences during masking. TLM and $LEM_{para}$ consider concatenated parallel sentences to apply the masking. In $LEM_{mono}$ and $LEM_{para}$, only a single token from the linguistic entity is masked.
Figure 3: The LEM continual pre-training process. As multiPLM, we select an existing multilingual pre-trained language model. The first step ie. LEM$_{mono}$ is to continually pre-train with stacked monolingual sentences, meaning the monolingual data from the source side is passed first, followed up by the target language monolingual data. In the second continual pre-training step ie. LEM$_{para}$, the LEM strategy is applied on the concatenated parallel data.
Figure 4: Bitext mining Recall scores for using independent monolingual data (MADLAD-400) versus dependent monolingual data obtained from the parallel corpus (SiTa-Trilingual parallel Corpus).

Linguistic Entity Masking to Improve Cross-Lingual Representation of Multilingual Language Models for Low-Resource Languages

TL;DR

Abstract

Linguistic Entity Masking to Improve Cross-Lingual Representation of Multilingual Language Models for Low-Resource Languages

Authors

TL;DR

Abstract

Table of Contents

Figures (4)