Table of Contents
Fetching ...

mPLM-Sim: Better Cross-Lingual Similarity and Transfer in Multilingual Pretrained Language Models

Peiqin Lin, Chengzhi Hu, Zheyu Zhang, André F. T. Martins, Hinrich Schütze

TL;DR

The paper introduces mPLM-Sim, a language similarity measure derived from multilingual pretrained language models using multi-parallel corpora to quantify cross-language relations. It conducts a comprehensive analysis across 11 mPLMs, three corpora, and 32 languages to examine how similarity signals vary by layer, model, and data source, and it evaluates the method on zero-shot cross-lingual transfer tasks. The results show that mPLM-Sim correlates with lexical and certain high-level linguistic features and generally yields 1–2% improvements over traditional linguistic similarity measures in source-language selection for cross-lingual transfer. These findings demonstrate that language subspaces learned during pretraining can be leveraged to enhance multilingual transfer performance and offer practical guidance for selecting source languages in low-resource settings.

Abstract

Recent multilingual pretrained language models (mPLMs) have been shown to encode strong language-specific signals, which are not explicitly provided during pretraining. It remains an open question whether it is feasible to employ mPLMs to measure language similarity, and subsequently use the similarity results to select source languages for boosting cross-lingual transfer. To investigate this, we propose mPLMSim, a language similarity measure that induces the similarities across languages from mPLMs using multi-parallel corpora. Our study shows that mPLM-Sim exhibits moderately high correlations with linguistic similarity measures, such as lexicostatistics, genealogical language family, and geographical sprachbund. We also conduct a case study on languages with low correlation and observe that mPLM-Sim yields more accurate similarity results. Additionally, we find that similarity results vary across different mPLMs and different layers within an mPLM. We further investigate whether mPLMSim is effective for zero-shot cross-lingual transfer by conducting experiments on both low-level syntactic tasks and high-level semantic tasks. The experimental results demonstrate that mPLM-Sim is capable of selecting better source languages than linguistic measures, resulting in a 1%-2% improvement in zero-shot cross-lingual transfer performance.

mPLM-Sim: Better Cross-Lingual Similarity and Transfer in Multilingual Pretrained Language Models

TL;DR

The paper introduces mPLM-Sim, a language similarity measure derived from multilingual pretrained language models using multi-parallel corpora to quantify cross-language relations. It conducts a comprehensive analysis across 11 mPLMs, three corpora, and 32 languages to examine how similarity signals vary by layer, model, and data source, and it evaluates the method on zero-shot cross-lingual transfer tasks. The results show that mPLM-Sim correlates with lexical and certain high-level linguistic features and generally yields 1–2% improvements over traditional linguistic similarity measures in source-language selection for cross-lingual transfer. These findings demonstrate that language subspaces learned during pretraining can be leveraged to enhance multilingual transfer performance and offer practical guidance for selecting source languages in low-resource settings.

Abstract

Recent multilingual pretrained language models (mPLMs) have been shown to encode strong language-specific signals, which are not explicitly provided during pretraining. It remains an open question whether it is feasible to employ mPLMs to measure language similarity, and subsequently use the similarity results to select source languages for boosting cross-lingual transfer. To investigate this, we propose mPLMSim, a language similarity measure that induces the similarities across languages from mPLMs using multi-parallel corpora. Our study shows that mPLM-Sim exhibits moderately high correlations with linguistic similarity measures, such as lexicostatistics, genealogical language family, and geographical sprachbund. We also conduct a case study on languages with low correlation and observe that mPLM-Sim yields more accurate similarity results. Additionally, we find that similarity results vary across different mPLMs and different layers within an mPLM. We further investigate whether mPLMSim is effective for zero-shot cross-lingual transfer by conducting experiments on both low-level syntactic tasks and high-level semantic tasks. The experimental results demonstrate that mPLM-Sim is capable of selecting better source languages than linguistic measures, resulting in a 1%-2% improvement in zero-shot cross-lingual transfer performance.
Paper Structure (26 sections, 7 figures, 21 tables)

This paper contains 26 sections, 7 figures, 21 tables.

Figures (7)

  • Figure 1: Comparison across layers: Pearson correlation (MEAN) between mPLM-Sim and linguistic similarity measures across layers for Glot500 and Flores on 32 languages. Correlation between mPLM-Sim and LEX peaks in the first layer and decreases, while the correlation with GEN, GEO, and SYN slightly increases in the low layers before reaching its peak.
  • Figure 2: Macro average results (averaged over target languages) on cross-lingual transfer for baselines and for mPLM-Sim in all layers of Glot500. ENG represents using English as the source language. LEX, GEN, GEO, and FEA indicate using the most similar languages based on the corresponding similarity measures as the source language. The red dots of mPLM-Sim highlight the layer with the highest score.
  • Figure 3: Dendrograms illustrating hierarchical clustering results at layer 0, 4, 8, and 12 for Glot500 and Flores across 32 languages.
  • Figure 4: Heatmaps of cosine similarity results at layer 0 for Glot500 and Flores across 32 languages.
  • Figure 5: Heatmaps of cosine similarity results at layer 4 for Glot500 and Flores across 32 languages.
  • ...and 2 more figures