Table of Contents
Fetching ...

Analyzing the Evaluation of Cross-Lingual Knowledge Transfer in Multilingual Language Models

Sara Rajaee, Christof Monz

TL;DR

The paper challenges the assumption that high zero-shot cross-lingual performance in multilingual transformers reflects true cross-lingual linguistic knowledge. It introduces across-language evaluation by constructing tasks where inputs combine multiple languages across NLI, Paraphrase Identification, and Question Answering, and complements this with control tasks to probe reliance on linguistic cues. The authors find that performance gains largely stem from dataset artifacts and shallow features, with especially strong effects for low-resource languages, and that even across-language fine-tuning does not guarantee robust transfer. They propose new evaluation designs, including secondary baselines and data-/task-independent assessments, to more accurately quantify cross-lingual abilities and guide future model development.

Abstract

Recent advances in training multilingual language models on large datasets seem to have shown promising results in knowledge transfer across languages and achieve high performance on downstream tasks. However, we question to what extent the current evaluation benchmarks and setups accurately measure zero-shot cross-lingual knowledge transfer. In this work, we challenge the assumption that high zero-shot performance on target tasks reflects high cross-lingual ability by introducing more challenging setups involving instances with multiple languages. Through extensive experiments and analysis, we show that the observed high performance of multilingual models can be largely attributed to factors not requiring the transfer of actual linguistic knowledge, such as task- and surface-level knowledge. More specifically, we observe what has been transferred across languages is mostly data artifacts and biases, especially for low-resource languages. Our findings highlight the overlooked drawbacks of existing cross-lingual test data and evaluation setups, calling for a more nuanced understanding of the cross-lingual capabilities of multilingual models.

Analyzing the Evaluation of Cross-Lingual Knowledge Transfer in Multilingual Language Models

TL;DR

The paper challenges the assumption that high zero-shot cross-lingual performance in multilingual transformers reflects true cross-lingual linguistic knowledge. It introduces across-language evaluation by constructing tasks where inputs combine multiple languages across NLI, Paraphrase Identification, and Question Answering, and complements this with control tasks to probe reliance on linguistic cues. The authors find that performance gains largely stem from dataset artifacts and shallow features, with especially strong effects for low-resource languages, and that even across-language fine-tuning does not guarantee robust transfer. They propose new evaluation designs, including secondary baselines and data-/task-independent assessments, to more accurately quantify cross-lingual abilities and guide future model development.

Abstract

Recent advances in training multilingual language models on large datasets seem to have shown promising results in knowledge transfer across languages and achieve high performance on downstream tasks. However, we question to what extent the current evaluation benchmarks and setups accurately measure zero-shot cross-lingual knowledge transfer. In this work, we challenge the assumption that high zero-shot performance on target tasks reflects high cross-lingual ability by introducing more challenging setups involving instances with multiple languages. Through extensive experiments and analysis, we show that the observed high performance of multilingual models can be largely attributed to factors not requiring the transfer of actual linguistic knowledge, such as task- and surface-level knowledge. More specifically, we observe what has been transferred across languages is mostly data artifacts and biases, especially for low-resource languages. Our findings highlight the overlooked drawbacks of existing cross-lingual test data and evaluation setups, calling for a more nuanced understanding of the cross-lingual capabilities of multilingual models.
Paper Structure (19 sections, 12 figures, 14 tables)

This paper contains 19 sections, 12 figures, 14 tables.

Figures (12)

  • Figure 1: The average distance of the questions' words occurred in the context to the center of the answer span in the top $20\%$ easiest and hardest instances for the mBERT fine-tuned on SQuAD evaluated on every language. As the average distance increases, the model's performance drops.
  • Figure 2: Language-pairs accuracy scores for mBERT on the multilingual NLI task.
  • Figure 3: Language-pairs accuracy scores for XLM-r on the multilingual NLI task.
  • Figure 4: Language-pairs accuracy scores for INFOXLM on the multilingual NLI task.
  • Figure 5: Language-pairs accuracy scores for mBERT on the multilingual PI task.
  • ...and 7 more figures