Table of Contents
Fetching ...

The Model's Language Matters: A Comparative Privacy Analysis of LLMs

Abhishek K. Mishra, Antoine Boutet, Lucas Magnana

TL;DR

The paper tackles the problem of privacy leakage in multilingual LLM deployments, arguing that language structure meaningfully affects memorization and inference risks. It analyzes English, Spanish, French, and Italian medical corpora by fine-tuning encoder-only and decoder-only models and evaluating three attacks: extraction, counterfactual memorization, and membership inference, while quantifying six linguistic indicators $M$, $S$, $R$, $T$, $C$, and $D$ to relate structure to leakage. The findings show that leakage scales with linguistic redundancy and tokenization granularity, with Italian exhibiting the strongest leakage and English the strongest membership separability; French and Spanish demonstrate more resilience due to morphological complexity. These results provide the first quantitative evidence that language matters for privacy leakage, motivating language-aware privacy defenses and tailored mitigation strategies for multilingual NLP deployments.

Abstract

Large Language Models (LLMs) are increasingly deployed across multilingual applications that handle sensitive data, yet their scale and linguistic variability introduce major privacy risks. Mostly evaluated for English, this paper investigates how language structure affects privacy leakage in LLMs trained on English, Spanish, French, and Italian medical corpora. We quantify six linguistic indicators and evaluate three attack vectors: extraction, counterfactual memorization, and membership inference. Results show that privacy vulnerability scales with linguistic redundancy and tokenization granularity: Italian exhibits the strongest leakage, while English shows higher membership separability. In contrast, French and Spanish display greater resilience due to higher morphological complexity. Overall, our findings provide the first quantitative evidence that language matters in privacy leakage, underscoring the need for language-aware privacy-preserving mechanisms in LLM deployments.

The Model's Language Matters: A Comparative Privacy Analysis of LLMs

TL;DR

The paper tackles the problem of privacy leakage in multilingual LLM deployments, arguing that language structure meaningfully affects memorization and inference risks. It analyzes English, Spanish, French, and Italian medical corpora by fine-tuning encoder-only and decoder-only models and evaluating three attacks: extraction, counterfactual memorization, and membership inference, while quantifying six linguistic indicators , , , , , and to relate structure to leakage. The findings show that leakage scales with linguistic redundancy and tokenization granularity, with Italian exhibiting the strongest leakage and English the strongest membership separability; French and Spanish demonstrate more resilience due to morphological complexity. These results provide the first quantitative evidence that language matters for privacy leakage, motivating language-aware privacy defenses and tailored mitigation strategies for multilingual NLP deployments.

Abstract

Large Language Models (LLMs) are increasingly deployed across multilingual applications that handle sensitive data, yet their scale and linguistic variability introduce major privacy risks. Mostly evaluated for English, this paper investigates how language structure affects privacy leakage in LLMs trained on English, Spanish, French, and Italian medical corpora. We quantify six linguistic indicators and evaluate three attack vectors: extraction, counterfactual memorization, and membership inference. Results show that privacy vulnerability scales with linguistic redundancy and tokenization granularity: Italian exhibits the strongest leakage, while English shows higher membership separability. In contrast, French and Spanish display greater resilience due to higher morphological complexity. Overall, our findings provide the first quantitative evidence that language matters in privacy leakage, underscoring the need for language-aware privacy-preserving mechanisms in LLM deployments.

Paper Structure

This paper contains 27 sections, 6 equations, 7 figures, 1 table.

Figures (7)

  • Figure 1: Number of unique extractions across languages and prompt sizes: longer prompts increase extraction risk in general.
  • Figure 2: Cumulative distribution of text lengths for all versus extracted samples.
  • Figure 3: Distribution of counterfactual memorization scores across languages. Most points lie near zero; EN and IT display extended positive tails, FR shows rare high outliers, and ES remains the most compact.
  • Figure 4: Label distributions used for memorization scoring: balanced bins across languages confirm that score variations are not due to class imbalance.
  • Figure 5: Separability of "in" vs. "out" samples at epoch 30 under MIAs: larger gaps indicate higher risk. English exhibits the most distinct separation between training and test data, while French shows the greatest overlap, indicating stronger generalization.
  • ...and 2 more figures