Table of Contents
Fetching ...

Shared Path: Unraveling Memorization in Multilingual LLMs through Language Similarities

Xiaoyu Luo, Yiyi Chen, Johannes Bjerva, Qiongxiu Li

TL;DR

This work addresses memorization in multilingual LLMs and challenges the notion that data volume alone explains memorization. It introduces a language similarity graph framework and a graph-based correlation metric to analyze cross-lingual memorization across 95 languages and multiple model architectures. The study finds that among similar languages, those with fewer training tokens can exhibit higher memorization, a pattern that emerges only when language relationships are modeled. The findings highlight the importance of language-aware memorization audits and have broad implications for cross-lingual leakage risks and multilingual NLP practices.

Abstract

We present the first comprehensive study of Memorization in Multilingual Large Language Models (MLLMs), analyzing 95 languages using models across diverse model scales, architectures, and memorization definitions. As MLLMs are increasingly deployed, understanding their memorization behavior has become critical. Yet prior work has focused primarily on monolingual models, leaving multilingual memorization underexplored, despite the inherently long-tailed nature of training corpora. We find that the prevailing assumption, that memorization is highly correlated with training data availability, fails to fully explain memorization patterns in MLLMs. We hypothesize that the conventional focus on monolingual settings, effectively treating languages in isolation, may obscure the true patterns of memorization. To address this, we propose a novel graph-based correlation metric that incorporates language similarity to analyze cross-lingual memorization. Our analysis reveals that among similar languages, those with fewer training tokens tend to exhibit higher memorization, a trend that only emerges when cross-lingual relationships are explicitly modeled. These findings underscore the importance of a \textit{language-aware} perspective in evaluating and mitigating memorization vulnerabilities in MLLMs. This also constitutes empirical evidence that language similarity both explains Memorization in MLLMs and underpins Cross-lingual Transferability, with broad implications for multilingual NLP.

Shared Path: Unraveling Memorization in Multilingual LLMs through Language Similarities

TL;DR

This work addresses memorization in multilingual LLMs and challenges the notion that data volume alone explains memorization. It introduces a language similarity graph framework and a graph-based correlation metric to analyze cross-lingual memorization across 95 languages and multiple model architectures. The study finds that among similar languages, those with fewer training tokens can exhibit higher memorization, a pattern that emerges only when language relationships are modeled. The findings highlight the importance of language-aware memorization audits and have broad implications for cross-lingual leakage risks and multilingual NLP practices.

Abstract

We present the first comprehensive study of Memorization in Multilingual Large Language Models (MLLMs), analyzing 95 languages using models across diverse model scales, architectures, and memorization definitions. As MLLMs are increasingly deployed, understanding their memorization behavior has become critical. Yet prior work has focused primarily on monolingual models, leaving multilingual memorization underexplored, despite the inherently long-tailed nature of training corpora. We find that the prevailing assumption, that memorization is highly correlated with training data availability, fails to fully explain memorization patterns in MLLMs. We hypothesize that the conventional focus on monolingual settings, effectively treating languages in isolation, may obscure the true patterns of memorization. To address this, we propose a novel graph-based correlation metric that incorporates language similarity to analyze cross-lingual memorization. Our analysis reveals that among similar languages, those with fewer training tokens tend to exhibit higher memorization, a trend that only emerges when cross-lingual relationships are explicitly modeled. These findings underscore the importance of a \textit{language-aware} perspective in evaluating and mitigating memorization vulnerabilities in MLLMs. This also constitutes empirical evidence that language similarity both explains Memorization in MLLMs and underpins Cross-lingual Transferability, with broad implications for multilingual NLP.

Paper Structure

This paper contains 37 sections, 9 equations, 10 figures, 14 tables, 1 algorithm.

Figures (10)

  • Figure 1: Overview of our Framework for Analyzing Memorization in MLLMs using Language Similarity Graph-based Correlation Analysis.
  • Figure 2: Example graphs considering Intra-Topology and Cross-Topology.
  • Figure 3: Graph Construction at Different Thresholds $\theta$.
  • Figure 4: Intra-Topology and Cross-Topology Correlation Coefficients ($\rho_G$) across varying thresholds $\theta$. Top: Memorization Rates across Thresholds. Bottom: Topology graph information via subgraph and singleton counts at varying threshold ($x$-axis), from 6 to 20 language groups ($y$-axis), with a total of 95 languages. Takeaway: Cross-lingual transferability among similar languages impact memorization.
  • Figure 5: Layer-wise trend for Lang2Vec (Syntax).
  • ...and 5 more figures