Table of Contents
Fetching ...

Layer-Targeted Multilingual Knowledge Erasure in Large Language Models

Taoran Li, Varun Chandrasekaran, Zhiyuan Yu

TL;DR

MUTE (Multilingual Unlearning via Targeted Erasure), a framework that uses Centered Kernel Alignment and Linguistic Regions Development Score to identify intermediate, language-agnostic layers where cross-lingual representations converge, achieves robust multilingual knowledge erasure while optimizing on only a small set of source languages.

Abstract

Recent work has demonstrated that machine unlearning in Large Language Models (LLMs) fails to generalize across languages: knowledge erased in one language frequently remains accessible through others. However, the underlying cause of this failure and a principled solution remain open. In this work, we identify intervention depth as the key factor determining multilingual generalization. Through systematic layer-wise experiments, we characterize two distinct failure modes: shallow-layer interventions achieve erasure but collapse multilingual capabilities in held-out languages, while deep-layer interventions preserve utility but fail to erase target knowledge even in source languages. These findings reveal that the choice of intervention layer is not a free parameter; it fundamentally determines whether multilingual unlearning succeeds. We propose MUTE (Multilingual Unlearning via Targeted Erasure), a framework that uses Centered Kernel Alignment (CKA) and Linguistic Regions Development Score (LRDS) to identify intermediate, language-agnostic layers where cross-lingual representations converge. By restricting unlearning updates to these layers, MUTE achieves robust multilingual knowledge erasure while optimizing on only a small set of source languages. Extensive experiments across three LLM architectures and three unlearning algorithms validate our approach, with mechanistic analysis via Logit Lens probing confirming genuine knowledge removal rather than output-level suppression.

Layer-Targeted Multilingual Knowledge Erasure in Large Language Models

TL;DR

MUTE (Multilingual Unlearning via Targeted Erasure), a framework that uses Centered Kernel Alignment and Linguistic Regions Development Score to identify intermediate, language-agnostic layers where cross-lingual representations converge, achieves robust multilingual knowledge erasure while optimizing on only a small set of source languages.

Abstract

Recent work has demonstrated that machine unlearning in Large Language Models (LLMs) fails to generalize across languages: knowledge erased in one language frequently remains accessible through others. However, the underlying cause of this failure and a principled solution remain open. In this work, we identify intervention depth as the key factor determining multilingual generalization. Through systematic layer-wise experiments, we characterize two distinct failure modes: shallow-layer interventions achieve erasure but collapse multilingual capabilities in held-out languages, while deep-layer interventions preserve utility but fail to erase target knowledge even in source languages. These findings reveal that the choice of intervention layer is not a free parameter; it fundamentally determines whether multilingual unlearning succeeds. We propose MUTE (Multilingual Unlearning via Targeted Erasure), a framework that uses Centered Kernel Alignment (CKA) and Linguistic Regions Development Score (LRDS) to identify intermediate, language-agnostic layers where cross-lingual representations converge. By restricting unlearning updates to these layers, MUTE achieves robust multilingual knowledge erasure while optimizing on only a small set of source languages. Extensive experiments across three LLM architectures and three unlearning algorithms validate our approach, with mechanistic analysis via Logit Lens probing confirming genuine knowledge removal rather than output-level suppression.
Paper Structure (41 sections, 13 equations, 4 figures, 13 tables)

This paper contains 41 sections, 13 equations, 4 figures, 13 tables.

Figures (4)

  • Figure 1: Multilingual unlearning requires targeting the right depth. We evaluate unlearning interventions at different layer depths, training on 3 source languages (EN, ES, PT) and evaluating on 7 languages including 4 held-out languages (DE, FR, HI, IT). Left: Shallow layer interventions effectively erase knowledge but catastrophically collapse multilingual capabilities. Held-out languages lose nearly all utility. Middle: Deep layer interventions preserve utility but fail to erase knowledge. The target information remains accessible. Right: Our method MUTE targets a language-agnostic region identified via multilingual representation analysis, achieving effective erasure while preserving multilingual utility.
  • Figure 2: Identifying the language-agnostic region. CKA-LRDS analysis for Llama-3.1. Blue line: multilingual CKA alignment (higher is better). Red line: LRDS (lower is more language-agnostic). Green shaded area: language-agnostic region$\Lambda$, satisfying both $\text{CKA} > \tau_{\text{align}}$ and $\text{LRDS} < \tau_{\text{spec}}$, where thresholds are computed via Equation \ref{['eq:threshold']}. We select Layer 9 for parameter-based unlearning and Layer 20 for activation-based unlearning.
  • Figure 3: Identifying the language-agnostic region for Qwen-2.5-7B. Blue line: multilingual CKA alignment (higher is better). Red line: LRDS (lower is more language-agnostic). Green shaded area: language-agnostic region$\Lambda$, satisfying both $\text{CKA} > \tau_{\text{align}}$ and $\text{LRDS} < \tau_{\text{spec}}$, where thresholds are computed via Equation \ref{['eq:threshold']}. We select Layer 19 as the optimal intervention point.
  • Figure 4: Identifying the language-agnostic region for BLOOM-7b1. Blue line: multilingual CKA alignment (higher is better). Red line: LRDS (lower is more language-agnostic). Green shaded area: language-agnostic region$\Lambda$, satisfying both $\text{CKA} > \tau_{\text{align}}$ and $\text{LRDS} < \tau_{\text{spec}}$, where thresholds are computed via Equation \ref{['eq:threshold']}. We select Layer 5 as the optimal intervention point.