Table of Contents
Fetching ...

Tracing the Roots of Facts in Multilingual Language Models: Independent, Shared, and Transferred Knowledge

Xin Zhao, Naoki Yoshinaga, Daisuke Oba

TL;DR

The paper investigates how multilingual language models acquire and represent factual knowledge across languages, revealing three distinct representations: language-independent, cross-lingual shared, and cross-lingual transferred. It combines mLAMA-based probing on encoder models (mBERT and XLM-R), neuron-level analysis via PROBELESS, and root-cause tracing to Wikipedia data to characterize how facts are learned and transferred across languages. Key findings show that probing performance scales with training data and tokenization nuances, that knowledge sharing is localized rather than global, and that cross-lingual fact representations exist but are incomplete and often rely on simple cues for absent facts. These insights highlight the challenge of maintaining consistent factual knowledge across languages and motivate targeted improvements in fact representation learning and richer multilingual probing datasets.

Abstract

Acquiring factual knowledge for language models (LMs) in low-resource languages poses a serious challenge, thus resorting to cross-lingual transfer in multilingual LMs (ML-LMs). In this study, we ask how ML-LMs acquire and represent factual knowledge. Using the multilingual factual knowledge probing dataset, mLAMA, we first conducted a neuron investigation of ML-LMs (specifically, multilingual BERT). We then traced the roots of facts back to the knowledge source (Wikipedia) to identify the ways in which ML-LMs acquire specific facts. We finally identified three patterns of acquiring and representing facts in ML-LMs: language-independent, cross-lingual shared and transferred, and devised methods for differentiating them. Our findings highlight the challenge of maintaining consistent factual knowledge across languages, underscoring the need for better fact representation learning in ML-LMs.

Tracing the Roots of Facts in Multilingual Language Models: Independent, Shared, and Transferred Knowledge

TL;DR

The paper investigates how multilingual language models acquire and represent factual knowledge across languages, revealing three distinct representations: language-independent, cross-lingual shared, and cross-lingual transferred. It combines mLAMA-based probing on encoder models (mBERT and XLM-R), neuron-level analysis via PROBELESS, and root-cause tracing to Wikipedia data to characterize how facts are learned and transferred across languages. Key findings show that probing performance scales with training data and tokenization nuances, that knowledge sharing is localized rather than global, and that cross-lingual fact representations exist but are incomplete and often rely on simple cues for absent facts. These insights highlight the challenge of maintaining consistent factual knowledge across languages and motivate targeted improvements in fact representation learning and richer multilingual probing datasets.

Abstract

Acquiring factual knowledge for language models (LMs) in low-resource languages poses a serious challenge, thus resorting to cross-lingual transfer in multilingual LMs (ML-LMs). In this study, we ask how ML-LMs acquire and represent factual knowledge. Using the multilingual factual knowledge probing dataset, mLAMA, we first conducted a neuron investigation of ML-LMs (specifically, multilingual BERT). We then traced the roots of facts back to the knowledge source (Wikipedia) to identify the ways in which ML-LMs acquire specific facts. We finally identified three patterns of acquiring and representing facts in ML-LMs: language-independent, cross-lingual shared and transferred, and devised methods for differentiating them. Our findings highlight the challenge of maintaining consistent factual knowledge across languages, underscoring the need for better fact representation learning in ML-LMs.
Paper Structure (35 sections, 2 equations, 8 figures, 9 tables)

This paper contains 35 sections, 2 equations, 8 figures, 9 tables.

Figures (8)

  • Figure 1: Three types of fact representation in ML-LMs; Facts are a) represented with distinct neurons across languages (language-independent), b) shared using the same neurons (cross-lingual (shared)), and c) transferred across languages (cross-lingual (transferred)).
  • Figure 2: Probing P@1 on mLAMA for full- and partial-match methods with mBERT and XLM-R.
  • Figure 3: Wikipedia data size of abstracts vs. Factual probing P@1 on mLAMA in mBERT in 53 languages.
  • Figure 4: Jaccard similarity matrix of shared factual knowledge across languages withmBERT.
  • Figure 5: Neuron activity with mBERT in four languages, English, German, Indonesian, and Malay, in response to the query "William Pitt the Younger used to work in [MASK]." Color intensity indicates neuron activity; neurons in each Transformer layer are grouped into 16 bins. Distinct activation patterns in the English-German and Indonesian-Malay pairs indicate cross-lingual knowledge neurons, while differences between the pairs indicate language-independent representations.
  • ...and 3 more figures